|
Sphinx
Community
Services
Misc
Subscribe in a reader
|
SPH_MATCH_EXTENDED doesnt work correctly with international char
Common forum |
1 | 2 | 3 | 4 | 5 | ... |
263 | 264 | 265 | 266 | next »» | Create new thread
|
Goodwill
Name: Goodwill Posts: 5 |
2007-12-16 23:11:37
| reply!
Hello,
I am getting different search results from match_any and match_extended searches, when
match_any returns results correctly, but extended search does not return anything for
international characters eg. @texts Iñtërnâtiônà lizætiøn. Data in database are
saved as utf8_general_ci, charset_type for index in sphinx.conf is set to utf-8.
For example
this works @texts aaa|bbb|test
and this doesnt @texts aaa|bbb|Iñtërnâtiônà lizætiøn|test (sphinx does not return
anything even for aaa, bbb or test in this case)
match_any mode works with international characters without problems. Any thoughts?
|
|
Goodwill
Name: Goodwill Posts: 5 |
to: Goodwill, 2007-12-16 23:16:06
| reply!
I forgot to mention, that I am using Sphinx 0.9.7 via phpapi.
|
|
shodan
Name: Andrew Aksyonoff Posts: 4117 |
to: Goodwill, 2007-12-17 00:19:46
| reply!
> match_any mode works with international characters without problems.
What's with 0.9.8 (there's been a number of fixes compared to 0.9.7)?
What's in charset_table and ngram_chars?
What's returned in $result["words"] section if using extended mode?
|
 |
|
Goodwill
Name: Goodwill Posts: 5 |
to: shodan, 2007-12-17 12:01:17
| reply!
> > match_any mode works with international characters without problems.
>
> What's with 0.9.8 (there's been a number of fixes compared to 0.9.7)?
>
> What's in charset_table and ngram_chars?
>
> What's returned in $result["words"] section if using extended mode?
I switched over to 0.9.8 sphinx yesterday with new php api also and got same results.
Match_any for Iñtërnâtiônàlizætiøn looks this way
Array
(
[error] =>
[warning] =>
[status] => 0
[fields] => Array
(
[0] => cache_id
[1] => title
[2] => description
[3] => tags
[4] => username
[5] => texts
)
[attrs] => Array
(
[added_date] => 2
)
[matches] => Array
(
[30050] => Array
(
[weight] => 3
[attrs] => Array
(
[added_date] => 0
)
)
)
[total] => 1
[total_found] => 1
[time] => 0.051
[words] => Array
(
[liz] => Array
(
[docs] => 1
[hits] => 4
)
)
)
match_extended returns for same word @text Iñtërnâtiônàlizætiøn
Array
(
[error] =>
[warning] =>
[status] => 0
[fields] => Array
(
[0] => cache_id
[1] => title
[2] => description
[3] => tags
[4] => username
[5] => texts
)
[attrs] => Array
(
[added_date] => 2
)
[total] => 0
[total_found] => 0
[time] => 0.000
[words] => Array
(
[i] => Array
(
[docs] => 0
[hits] => 0
)
[t] => Array
(
[docs] => 0
[hits] => 0
)
[rn] => Array
(
[docs] => 0
[hits] => 0
)
[ti] => Array
(
[docs] => 0
[hits] => 0
)
[n] => Array
(
[docs] => 0
[hits] => 0
)
[liz] => Array
(
[docs] => 1
[hits] => 4
)
)
)
I didnt set any charset_tables values, nor ngram_len or _chars. Data are utf8_general_ci
in database, charset_type is set utf-8. I guess the problem may be in the charset table,
but I dont know how to use it in order to get results for documents which could contain
wide range of international characters.
|
 |
|
shodan
Name: Andrew Aksyonoff Posts: 4117 |
to: Goodwill, 2007-12-17 12:29:52
| reply!
> I guess the problem may be in the charset table,
Yes, this must be the issue. Default table only indexes English and Russian chars.
The simplest table addition that covers the whole Unicode range is U+80..U+2FFFF, and you
could use it for testing, but it has the following problems:
1) it includes both letters and non-letters,
2) it does *not* do any case folding.
Here's a table I once automatically built from Unicode character list. I have no idea
whether it's correct for any other languages but English and Russian. Also note that it
does not include anything but letters (no 0-9 numbers, no underscores, etc.)
charset_table = \
U+41...U+5a->U+61...U+7a, U+61...U+7a, \
U+aa, U+b5, U+ba, U+c0...U+d6->U+e0...U+f6, U+d8...U+de->U+f8...U+fe, U+df...U+f6,
U+f8...U+ff, \
U+100...U+12f/2, U+130->U+69, U+131, U+132...U+137/2, U+138, U+139...U+148/2, \
U+149, U+14a...U+177/2, U+178->U+ff, U+179...U+17e/2, U+17f...U+180, U+181->U+253, \
U+182...U+185/2, U+186->U+254, U+187...U+188/2, U+189...U+18a->U+256...U+257, \
U+18b...U+18c/2, U+18d, U+18e->U+1dd, U+18f->U+259, U+190->U+25b, U+191...U+192/2, \
U+193->U+260, U+194->U+263, U+195, U+196->U+269, U+197->U+268, U+198...U+199/2, \
U+19a...U+19b, U+19c->U+26f, U+19d->U+272, U+19e, U+19f->U+275, U+1a0...U+1a5/2, \
U+1a6->U+280, U+1a7...U+1a8/2, U+1a9->U+283, U+1aa...U+1ab, U+1ac...U+1ad/2, \
U+1ae->U+288, U+1af...U+1b0/2, U+1b1...U+1b2->U+28a...U+28b, U+1b3...U+1b6/2, \
U+1b7->U+292, U+1b8...U+1b9/2, U+1ba...U+1bb, U+1bc...U+1bd/2, U+1be...U+1c3, \
U+1c4->U+1c6, U+1c5...U+1c6/2, U+1c7->U+1c9, U+1c8...U+1c9/2, U+1ca->U+1cc, \
U+1cb...U+1dc/2, U+1dd, U+1de...U+1ef/2, U+1f0, U+1f1->U+1f3, U+1f2...U+1f5/2, \
U+1f6->U+195, U+1f7->U+1bf, U+1f8...U+21f/2, U+220->U+19e, U+221, U+222...U+233/2, \
U+234...U+23a, U+23b...U+23c/2, U+23d->U+19a, U+23e...U+240, U+241->U+294, \
U+250...U+2c1, U+2c6...U+2d1, U+2e0...U+2e4, U+2ee, \
U+1d00...U+1dbf, U+1e00...U+1e95/2, U+1e96...U+1e9b, U+1ea0...U+1ef9/2, \
U+37a, U+386...U+389->U+3ac...U+3af, U+38c...U+38e->U+3cc...U+3ce, U+390, \
U+391...U+3a1->U+3b1...U+3c1, U+3a3...U+3ab->U+3c3...U+3cb, U+3ac...U+3ce, \
U+3d0...U+3d7, U+3d8...U+3ef/2, U+3f0...U+3f3, U+3f4->U+3b8, U+3f5, \
U+3f7...U+3f8/2, U+3f9->U+3f2, U+3fa...U+3fb/2, U+3fc...U+3ff, \
U+400...U+40f->U+450...U+45f, U+410...U+42f->U+430...U+44f, \
U+430...U+45f, U+460...U+481/2, U+48a...U+4bf/2, U+4c0, \
U+4c1...U+4ce/2, U+4d0...U+4f9/2, U+500...U+50f/2, \
U+531...U+556->U+561...U+586, U+559, U+561...U+587, \
U+5d0...U+5ea, U+5f0...U+5f2, \
U+621...U+63a, U+640...U+64a, U+66e...U+66f, U+671...U+6d3, U+6d5, \
U+6e5...U+6e6, U+6ee...U+6ef, U+6fa...U+6fc, U+6ff, \
U+e01...U+e30, U+e32...U+e33, U+e40...U+e46, \
U+e81...U+e82, U+e84, U+e87...U+e88, U+e8a, U+e8d, U+e94...U+e97, U+e99...U+e9f, \
U+ea1...U+ea3, U+ea5, U+ea7, U+eaa...U+eab, U+ead...U+eb0, U+eb2...U+eb3, \
U+ebd, U+ec0...U+ec4, U+ec6, U+edc...U+edd, \
U+1000...U+1021, U+1023...U+1027, U+1029...U+102a, U+1050...U+1055, \
U+10a0...U+10c5->U+2d00...U+2d25, U+10d0...U+10fa, U+10fc, U+2d00...U+2d25, \
U+3005...U+3006, U+3031...U+3035, U+303b...U+303c, U+3041...U+3096, \
U+309d...U+309f, U+30a1...U+30fa, U+30fc...U+30ff, U+31f0...U+31ff
Here's the same table but in human-readable form. You could perhaps pull required pieces
from this one.
// latin classic
U+41..U+5a->U+61..U+7a, U+61..U+7a,
// latin accents
U+aa, U+b5, U+ba, U+c0..U+d6->U+e0..U+f6, U+d8..U+de->U+f8..U+fe, U+df..U+f6, U+f8..U+ff,
// more latin accents
U+100..U+12f/2, U+130->U+69, U+131, U+132..U+137/2, U+138, U+139..U+148/2,
U+149, U+14a..U+177/2, U+178->U+ff, U+179..U+17e/2, U+17f..U+180, U+181->U+253,
U+182..U+185/2, U+186->U+254, U+187..U+188/2, U+189..U+18a->U+256..U+257,
U+18b..U+18c/2, U+18d, U+18e->U+1dd, U+18f->U+259, U+190->U+25b, U+191..U+192/2,
U+193->U+260, U+194->U+263, U+195, U+196->U+269, U+197->U+268, U+198..U+199/2,
U+19a..U+19b, U+19c->U+26f, U+19d->U+272, U+19e, U+19f->U+275, U+1a0..U+1a5/2,
U+1a6->U+280, U+1a7..U+1a8/2, U+1a9->U+283, U+1aa..U+1ab, U+1ac..U+1ad/2,
U+1ae->U+288, U+1af..U+1b0/2, U+1b1..U+1b2->U+28a..U+28b, U+1b3..U+1b6/2,
U+1b7->U+292, U+1b8..U+1b9/2, U+1ba..U+1bb, U+1bc..U+1bd/2, U+1be..U+1c3,
U+1c4->U+1c6, U+1c5..U+1c6/2, U+1c7->U+1c9, U+1c8..U+1c9/2, U+1ca->U+1cc,
U+1cb..U+1dc/2, U+1dd, U+1de..U+1ef/2, U+1f0, U+1f1->U+1f3, U+1f2..U+1f5/2,
U+1f6->U+195, U+1f7->U+1bf, U+1f8..U+21f/2, U+220->U+19e, U+221, U+222..U+233/2,
U+234..U+23a, U+23b..U+23c/2, U+23d->U+19a, U+23e..U+240, U+241->U+294,
U+250..U+2c1, U+2c6..U+2d1, U+2e0..U+2e4, U+2ee,
// even more latin accents
U+1d00..U+1dbf, U+1e00..U+1e95/2, U+1e96..U+1e9b, U+1ea0..U+1ef9/2,
// greek
U+37a, U+386..U+389->U+3ac..U+3af, U+38c..U+38e->U+3cc..U+3ce, U+390,
U+391..U+3a1->U+3b1..U+3c1, U+3a3..U+3ab->U+3c3..U+3cb, U+3ac..U+3ce,
U+3d0..U+3d7, U+3d8..U+3ef/2, U+3f0..U+3f3, U+3f4->U+3b8, U+3f5,
U+3f7..U+3f8/2, U+3f9->U+3f2, U+3fa..U+3fb/2, U+3fc..U+3ff,
// cyrillic
U+400..U+40f->U+450..U+45f, U+410..U+42f->U+430..U+44f,
U+430..U+45f, U+460..U+481/2, U+48a..U+4bf/2, U+4c0,
U+4c1..U+4ce/2, U+4d0..U+4f9/2, U+500..U+50f/2,
// armenian
U+531..U+556->U+561..U+586, U+559, U+561..U+587,
// hebrew
U+5d0..U+5ea, U+5f0..U+5f2,
// arabic
U+621..U+63a, U+640..U+64a, U+66e..U+66f, U+671..U+6d3, U+6d5,
U+6e5..U+6e6, U+6ee..U+6ef, U+6fa..U+6fc, U+6ff,
// thai
U+e01..U+e30, U+e32..U+e33, U+e40..U+e46,
// lao
U+e81..U+e82, U+e84, U+e87..U+e88, U+e8a, U+e8d, U+e94..U+e97, U+e99..U+e9f,
U+ea1..U+ea3, U+ea5, U+ea7, U+eaa..U+eab, U+ead..U+eb0, U+eb2..U+eb3,
U+ebd, U+ec0..U+ec4, U+ec6, U+edc..U+edd,
// myanmar
U+1000..U+1021, U+1023..U+1027, U+1029..U+102a, U+1050..U+1055,
// georgian
U+10a0..U+10c5->U+2d00..U+2d25, U+10d0..U+10fa, U+10fc, U+2d00..U+2d25,
// katakana/hiragana
U+3005..U+3006, U+3031..U+3035, U+303b..U+303c, U+3041..U+3096,
U+309d..U+309f, U+30a1..U+30fa, U+30fc..U+30ff, U+31f0..U+31ff
|
 |
|
Goodwill
Name: Goodwill Posts: 5 |
to: shodan, 2007-12-18 00:51:38
| reply!
Thank you very much for the conversion table, I was playing with it a bit today and it
started to work on my linux install, though windows install still needs some tweaks to
work correctly as there is probably different mysql charset, though it is utf8 based on
both installs. Is there any suggested utf8 charset for connection and database or it
doesnt matter what kind of utf collation/connection charset I use? This part isnt clear
to me yet.
|
|
shodan
Name: Andrew Aksyonoff Posts: 4117 |
to: Goodwill, 2007-12-20 02:22:57
| reply!
> Is there any suggested utf8 charset for connection and database or it doesnt matter what
> kind of utf collation/connection charset I use?
The collation should not matter; the charset does. Normally you'd override per-connection
charset using
sql_query_pre = SET NAMES utf8
just to be on the safe side, and that's it.
|
 |
|
erelsgl
Name: Erel Posts: 14 |
to: shodan, 2008-03-26 13:54:23
| reply!
> Here's a table I once automatically built from Unicode character list. I have no idea
> whether it's correct for any other languages but English and Russian. Also note that it
> does not include anything but letters (no 0-9 numbers, no underscores, etc.)
>
> charset_table = \
> U+41...U+5a->U+61...U+7a, U+61...U+7a, \
> U+aa, U+b5, U+ba, U+c0...U+d6->U+e0...U+f6, U+d8...U+de->U+f8...U+fe, U+df...U+f6,
> U+f8...U+ff, \
> U+100...U+12f/2, U+130->U+69, U+131, U+132...U+137/2, U+138, U+139...U+148/2,
Note that you should use ".." and not "..."
|
 |
|
bfarber
Name: Brandon Farber Posts: 11 |
to: shodan, 2009-01-30 04:15:56
| reply!
> > I guess the problem may be in the charset table,
>
> Here's a table I once automatically built from Unicode character list. I have no idea
> whether it's correct for any other languages but English and Russian. Also note that it
> does not include anything but letters (no 0-9 numbers, no underscores, etc.)
>
> charset_table = \
> ....................................
> // greek
> U+37a, U+386..U+389->U+3ac..U+3af, U+38c..U+38e->U+3cc..U+3ce, U+390,
> U+391..U+3a1->U+3b1..U+3c1, U+3a3..U+3ab->U+3c3..U+3cb, U+3ac..U+3ce,
> U+3d0..U+3d7, U+3d8..U+3ef/2, U+3f0..U+3f3, U+3f4->U+3b8, U+3f5,
> U+3f7..U+3f8/2, U+3f9->U+3f2, U+3fa..U+3fb/2, U+3fc..U+3ff,
I'm working on a Greek site that we converted from iso-8859-7 (see this topic if you use
8859-7: http://sphinxsearch.com/forum/view.html?id=364 ) and had to fiddle with the
charset table quite a bit. A lot of the Greek characters are no longer used in the
language, but any with tonos or dialytika need to be folded into their non-accented
versions, and then of course case folding.
This is a long winded version of the Greek table. You might be able to condense it using
the .. and /2 shortcuts, but it seems to work in my testing so far.
Might help others, so decided to post it.
U+370->U+371, U+371, \
U+372->U+373, U+373, \
U+374->U+375, U+375, \
U+376->U+377, U+377, \
U+37a, \
U+3fd->U+37b, U+3fe->U+37b, U+3ff->U+37b, U+37b, \
U+3fe->U+37c, U+37c, \
U+37e, \
U+386->U+3b1, \
U+388->U+3b5, \
U+389->U+3b7, \
U+38a->U+3b9, \
U+38c->U+3bf, \
U+38e->U+3c5, \
U+38f->U+3c9, \
U+390->U+3b9, \
U+3aa->U+3b9, \
U+3ab->U+3c5, \
U+3ac->U+3b1, \
U+3ad->U+3b5, \
U+3ae->U+3b7, \
U+3af->U+3b9, \
U+3b0->U+3c5, \
U+3ca->U+3b9, \
U+3cb->U+3c5, \
U+3cc->U+3bf, \
U+3cd->U+3c5, \
U+3ce->U+3c9, \
U+3cf->U+3d7, U+3d7, \
U+3d0->U+3b2, \
U+3d1->U+3b8, \
U+3d2->U+3c5, \
U+3d3->U+3c5, \
U+3d4->U+3c5, \
U+3d5->U+3c6, \
U+3d6->U+3c0, \
U+3d8->U+3d9, U+3d9, \
U+3da->U+3db, U+3db, \
U+3dc->U+3dd, U+3dd, \
U+3de->U+3df, U+3df, \
U+3e0->U+3d1, U+3d1, \
U+391..U+3a1->U+3b1..U+3c1, U+3b1..U+3c1, \
U+3a3..U+3a9->U+3c3..U+3c9, U+3c3..U+3c9, \
|
 |
Common forum |
1 | 2 | 3 | 4 | 5 | ... |
263 | 264 | 265 | 266 | next »» | Create new thread
|