free open-source SQL full-text search engine

Need a hand?
+1-888-333-1345


Sphinx

Community

Services

Misc

 Subscribe in a reader

Tracked by ClickAider

Forums :: Register :: Login :: Forgot your password? :: Search

anonymous user


SPH_MATCH_EXTENDED doesnt work correctly with international char

Common forum | 1 | 2 | 3 | 4 | 5 | ... | 263 | 264 | 265 | 266 | next »» | Create new thread

Goodwill

Name: Goodwill
Posts: 5

2007-12-16 23:11:37 | reply!


Hello,
I am getting different search results from match_any and match_extended searches, when
match_any returns results correctly, but extended search does not return anything for
international characters eg. @texts Iñtërnâtiônàlizætiøn. Data in database are
saved as utf8_general_ci, charset_type for index in sphinx.conf is set to utf-8.

For example
this works @texts aaa|bbb|test
and this doesnt @texts aaa|bbb|Iñtërnâtiônàlizætiøn|test (sphinx does not return
anything even for aaa, bbb or test in this case)

match_any mode works with international characters without problems. Any thoughts?

Goodwill

Name: Goodwill
Posts: 5

to: Goodwill, 2007-12-16 23:16:06 | reply!


I forgot to mention, that I am using Sphinx 0.9.7 via phpapi.

shodan

Name: Andrew Aksyonoff
Posts: 4117

to: Goodwill, 2007-12-17 00:19:46 | reply!


> match_any mode works with international characters without problems.

What's with 0.9.8 (there's been a number of fixes compared to 0.9.7)?

What's in charset_table and ngram_chars?

What's returned in $result["words"] section if using extended mode?

Goodwill

Name: Goodwill
Posts: 5

to: shodan, 2007-12-17 12:01:17 | reply!


> > match_any mode works with international characters without problems.
>
> What's with 0.9.8 (there's been a number of fixes compared to 0.9.7)?
>
> What's in charset_table and ngram_chars?
>
> What's returned in $result["words"] section if using extended mode?

I switched over to 0.9.8 sphinx yesterday with new php api also and got same results.
Match_any for Iñtërnâtiônàlizætiøn looks this way

Array
(
        [error] =>
        [warning] =>
        [status] => 0
        [fields] => Array
                (
                        [0] => cache_id
                        [1] => title
                        [2] => description
                        [3] => tags
                        [4] => username
                        [5] => texts
                )

        [attrs] => Array
                (
                        [added_date] => 2
                )

        [matches] => Array
                (
                        [30050] => Array
                                (
                                [weight] => 3
                                [attrs] => Array
                                (
                                [added_date] => 0
                                )

                                )

                )

        [total] => 1
        [total_found] => 1
        [time] => 0.051
        [words] => Array
                (
                        [liz] => Array
                                (
                                [docs] => 1
                                [hits] => 4
                                )

                )

)


match_extended returns for same word @text Iñtërnâtiônàlizætiøn

Array
(
        [error] =>
        [warning] =>
        [status] => 0
        [fields] => Array
                (
                        [0] => cache_id
                        [1] => title
                        [2] => description
                        [3] => tags
                        [4] => username
                        [5] => texts
                )

        [attrs] => Array
                (
                        [added_date] => 2
                )

        [total] => 0
        [total_found] => 0
        [time] => 0.000
        [words] => Array
                (
                        [i] => Array
                                (
                                [docs] => 0
                                [hits] => 0
                                )

                        [t] => Array
                                (
                                [docs] => 0
                                [hits] => 0
                                )

                        [rn] => Array
                                (
                                [docs] => 0
                                [hits] => 0
                                )

                        [ti] => Array
                                (
                                [docs] => 0
                                [hits] => 0
                                )

                        [n] => Array
                                (
                                [docs] => 0
                                [hits] => 0
                                )

                        [liz] => Array
                                (
                                [docs] => 1
                                [hits] => 4
                                )

                )

)

I didnt set any charset_tables values, nor ngram_len or _chars. Data are utf8_general_ci
in database, charset_type is set utf-8. I guess the problem may be in the charset table,
but I dont know how to use it in order to get results for documents which could contain
wide range of international characters.

shodan

Name: Andrew Aksyonoff
Posts: 4117

to: Goodwill, 2007-12-17 12:29:52 | reply!


> I guess the problem may be in the charset table,

Yes, this must be the issue. Default table only indexes English and Russian chars.

The simplest table addition that covers the whole Unicode range is U+80..U+2FFFF, and you
could use it for testing, but it has the following problems:
1) it includes both letters and non-letters,
2) it does *not* do any case folding.

Here's a table I once automatically built from Unicode character list. I have no idea
whether it's correct for any other languages but English and Russian. Also note that it
does not include anything but letters (no 0-9 numbers, no underscores, etc.)

charset_table = \
U+41...U+5a->U+61...U+7a, U+61...U+7a, \
U+aa, U+b5, U+ba, U+c0...U+d6->U+e0...U+f6, U+d8...U+de->U+f8...U+fe, U+df...U+f6,
U+f8...U+ff, \
U+100...U+12f/2, U+130->U+69, U+131, U+132...U+137/2, U+138, U+139...U+148/2, \
U+149, U+14a...U+177/2, U+178->U+ff, U+179...U+17e/2, U+17f...U+180, U+181->U+253, \
U+182...U+185/2, U+186->U+254, U+187...U+188/2, U+189...U+18a->U+256...U+257, \
U+18b...U+18c/2, U+18d, U+18e->U+1dd, U+18f->U+259, U+190->U+25b, U+191...U+192/2, \
U+193->U+260, U+194->U+263, U+195, U+196->U+269, U+197->U+268, U+198...U+199/2, \
U+19a...U+19b, U+19c->U+26f, U+19d->U+272, U+19e, U+19f->U+275, U+1a0...U+1a5/2, \
U+1a6->U+280, U+1a7...U+1a8/2, U+1a9->U+283, U+1aa...U+1ab, U+1ac...U+1ad/2, \
U+1ae->U+288, U+1af...U+1b0/2, U+1b1...U+1b2->U+28a...U+28b, U+1b3...U+1b6/2, \
U+1b7->U+292, U+1b8...U+1b9/2, U+1ba...U+1bb, U+1bc...U+1bd/2, U+1be...U+1c3, \
U+1c4->U+1c6, U+1c5...U+1c6/2, U+1c7->U+1c9, U+1c8...U+1c9/2, U+1ca->U+1cc, \
U+1cb...U+1dc/2, U+1dd, U+1de...U+1ef/2, U+1f0, U+1f1->U+1f3, U+1f2...U+1f5/2, \
U+1f6->U+195, U+1f7->U+1bf, U+1f8...U+21f/2, U+220->U+19e, U+221, U+222...U+233/2, \
U+234...U+23a, U+23b...U+23c/2, U+23d->U+19a, U+23e...U+240, U+241->U+294, \
U+250...U+2c1, U+2c6...U+2d1, U+2e0...U+2e4, U+2ee, \
U+1d00...U+1dbf, U+1e00...U+1e95/2, U+1e96...U+1e9b, U+1ea0...U+1ef9/2, \
U+37a, U+386...U+389->U+3ac...U+3af, U+38c...U+38e->U+3cc...U+3ce, U+390, \
U+391...U+3a1->U+3b1...U+3c1, U+3a3...U+3ab->U+3c3...U+3cb, U+3ac...U+3ce, \
U+3d0...U+3d7, U+3d8...U+3ef/2, U+3f0...U+3f3, U+3f4->U+3b8, U+3f5, \
U+3f7...U+3f8/2, U+3f9->U+3f2, U+3fa...U+3fb/2, U+3fc...U+3ff, \
U+400...U+40f->U+450...U+45f, U+410...U+42f->U+430...U+44f, \
U+430...U+45f, U+460...U+481/2, U+48a...U+4bf/2, U+4c0, \
U+4c1...U+4ce/2, U+4d0...U+4f9/2, U+500...U+50f/2, \
U+531...U+556->U+561...U+586, U+559, U+561...U+587, \
U+5d0...U+5ea, U+5f0...U+5f2, \
U+621...U+63a, U+640...U+64a, U+66e...U+66f, U+671...U+6d3, U+6d5, \
U+6e5...U+6e6, U+6ee...U+6ef, U+6fa...U+6fc, U+6ff, \
U+e01...U+e30, U+e32...U+e33, U+e40...U+e46, \
U+e81...U+e82, U+e84, U+e87...U+e88, U+e8a, U+e8d, U+e94...U+e97, U+e99...U+e9f, \
U+ea1...U+ea3, U+ea5, U+ea7, U+eaa...U+eab, U+ead...U+eb0, U+eb2...U+eb3, \
U+ebd, U+ec0...U+ec4, U+ec6, U+edc...U+edd, \
U+1000...U+1021, U+1023...U+1027, U+1029...U+102a, U+1050...U+1055, \
U+10a0...U+10c5->U+2d00...U+2d25, U+10d0...U+10fa, U+10fc, U+2d00...U+2d25, \
U+3005...U+3006, U+3031...U+3035, U+303b...U+303c, U+3041...U+3096, \
U+309d...U+309f, U+30a1...U+30fa, U+30fc...U+30ff, U+31f0...U+31ff

Here's the same table but in human-readable form. You could perhaps pull required pieces
from this one.

// latin classic
U+41..U+5a->U+61..U+7a, U+61..U+7a,

// latin accents
U+aa, U+b5, U+ba, U+c0..U+d6->U+e0..U+f6, U+d8..U+de->U+f8..U+fe, U+df..U+f6, U+f8..U+ff,

// more latin accents
U+100..U+12f/2, U+130->U+69, U+131, U+132..U+137/2, U+138, U+139..U+148/2,
U+149, U+14a..U+177/2, U+178->U+ff, U+179..U+17e/2, U+17f..U+180, U+181->U+253,
U+182..U+185/2, U+186->U+254, U+187..U+188/2, U+189..U+18a->U+256..U+257,
U+18b..U+18c/2, U+18d, U+18e->U+1dd, U+18f->U+259, U+190->U+25b, U+191..U+192/2,
U+193->U+260, U+194->U+263, U+195, U+196->U+269, U+197->U+268, U+198..U+199/2,
U+19a..U+19b, U+19c->U+26f, U+19d->U+272, U+19e, U+19f->U+275, U+1a0..U+1a5/2,
U+1a6->U+280, U+1a7..U+1a8/2, U+1a9->U+283, U+1aa..U+1ab, U+1ac..U+1ad/2,
U+1ae->U+288, U+1af..U+1b0/2, U+1b1..U+1b2->U+28a..U+28b, U+1b3..U+1b6/2,
U+1b7->U+292, U+1b8..U+1b9/2, U+1ba..U+1bb, U+1bc..U+1bd/2, U+1be..U+1c3,
U+1c4->U+1c6, U+1c5..U+1c6/2, U+1c7->U+1c9, U+1c8..U+1c9/2, U+1ca->U+1cc,
U+1cb..U+1dc/2, U+1dd, U+1de..U+1ef/2, U+1f0, U+1f1->U+1f3, U+1f2..U+1f5/2,
U+1f6->U+195, U+1f7->U+1bf, U+1f8..U+21f/2, U+220->U+19e, U+221, U+222..U+233/2,
U+234..U+23a, U+23b..U+23c/2, U+23d->U+19a, U+23e..U+240, U+241->U+294,
U+250..U+2c1, U+2c6..U+2d1, U+2e0..U+2e4, U+2ee,

// even more latin accents
U+1d00..U+1dbf, U+1e00..U+1e95/2, U+1e96..U+1e9b, U+1ea0..U+1ef9/2,


// greek
U+37a, U+386..U+389->U+3ac..U+3af, U+38c..U+38e->U+3cc..U+3ce, U+390,
U+391..U+3a1->U+3b1..U+3c1, U+3a3..U+3ab->U+3c3..U+3cb, U+3ac..U+3ce,
U+3d0..U+3d7, U+3d8..U+3ef/2, U+3f0..U+3f3, U+3f4->U+3b8, U+3f5,
U+3f7..U+3f8/2, U+3f9->U+3f2, U+3fa..U+3fb/2, U+3fc..U+3ff,


// cyrillic
U+400..U+40f->U+450..U+45f, U+410..U+42f->U+430..U+44f,
U+430..U+45f, U+460..U+481/2, U+48a..U+4bf/2, U+4c0,
U+4c1..U+4ce/2, U+4d0..U+4f9/2, U+500..U+50f/2,


// armenian
U+531..U+556->U+561..U+586, U+559, U+561..U+587,


// hebrew
U+5d0..U+5ea, U+5f0..U+5f2,


// arabic
U+621..U+63a, U+640..U+64a, U+66e..U+66f, U+671..U+6d3, U+6d5,
U+6e5..U+6e6, U+6ee..U+6ef, U+6fa..U+6fc, U+6ff,


// thai
U+e01..U+e30, U+e32..U+e33, U+e40..U+e46,


// lao
U+e81..U+e82, U+e84, U+e87..U+e88, U+e8a, U+e8d, U+e94..U+e97, U+e99..U+e9f,
U+ea1..U+ea3, U+ea5, U+ea7, U+eaa..U+eab, U+ead..U+eb0, U+eb2..U+eb3,
U+ebd, U+ec0..U+ec4, U+ec6, U+edc..U+edd,


// myanmar
U+1000..U+1021, U+1023..U+1027, U+1029..U+102a, U+1050..U+1055,


// georgian
U+10a0..U+10c5->U+2d00..U+2d25, U+10d0..U+10fa, U+10fc, U+2d00..U+2d25,


// katakana/hiragana
U+3005..U+3006, U+3031..U+3035, U+303b..U+303c, U+3041..U+3096,
U+309d..U+309f, U+30a1..U+30fa, U+30fc..U+30ff, U+31f0..U+31ff

Goodwill

Name: Goodwill
Posts: 5

to: shodan, 2007-12-18 00:51:38 | reply!


Thank you very much for the conversion table, I was playing with it a bit today and it
started to work on my linux install, though windows install still needs some tweaks to
work correctly as there is probably different mysql charset, though it is utf8 based on
both installs. Is there any suggested utf8 charset for connection and database or it
doesnt matter what kind of utf collation/connection charset I use? This part isnt clear
to me yet.

shodan

Name: Andrew Aksyonoff
Posts: 4117

to: Goodwill, 2007-12-20 02:22:57 | reply!


> Is there any suggested utf8 charset for connection and database or it doesnt matter what
> kind of utf collation/connection charset I use?

The collation should not matter; the charset does. Normally you'd override per-connection
charset using

sql_query_pre = SET NAMES utf8

just to be on the safe side, and that's it.

erelsgl

Name: Erel
Posts: 14

to: shodan, 2008-03-26 13:54:23 | reply!


> Here's a table I once automatically built from Unicode character list. I have no idea
> whether it's correct for any other languages but English and Russian. Also note that it
> does not include anything but letters (no 0-9 numbers, no underscores, etc.)
>
> charset_table = \
> U+41...U+5a->U+61...U+7a, U+61...U+7a, \
> U+aa, U+b5, U+ba, U+c0...U+d6->U+e0...U+f6, U+d8...U+de->U+f8...U+fe, U+df...U+f6,
> U+f8...U+ff, \
> U+100...U+12f/2, U+130->U+69, U+131, U+132...U+137/2, U+138, U+139...U+148/2,

Note that you should use ".." and not "..."

bfarber

Name: Brandon Farber
Posts: 11

to: shodan, 2009-01-30 04:15:56 | reply!


> > I guess the problem may be in the charset table,
>
> Here's a table I once automatically built from Unicode character list. I have no idea
> whether it's correct for any other languages but English and Russian. Also note that it
> does not include anything but letters (no 0-9 numbers, no underscores, etc.)
>
> charset_table = \
> ....................................
> // greek
> U+37a, U+386..U+389->U+3ac..U+3af, U+38c..U+38e->U+3cc..U+3ce, U+390,
> U+391..U+3a1->U+3b1..U+3c1, U+3a3..U+3ab->U+3c3..U+3cb, U+3ac..U+3ce,
> U+3d0..U+3d7, U+3d8..U+3ef/2, U+3f0..U+3f3, U+3f4->U+3b8, U+3f5,
> U+3f7..U+3f8/2, U+3f9->U+3f2, U+3fa..U+3fb/2, U+3fc..U+3ff,

I'm working on a Greek site that we converted from iso-8859-7 (see this topic if you use
8859-7: http://sphinxsearch.com/forum/view.html?id=364 ) and had to fiddle with the
charset table quite a bit. A lot of the Greek characters are no longer used in the
language, but any with tonos or dialytika need to be folded into their non-accented
versions, and then of course case folding.

This is a long winded version of the Greek table. You might be able to condense it using
the .. and /2 shortcuts, but it seems to work in my testing so far.

Might help others, so decided to post it.

                        U+370->U+371, U+371, \
                        U+372->U+373, U+373, \
                        U+374->U+375, U+375, \
                        U+376->U+377, U+377, \
                        U+37a, \
                        U+3fd->U+37b, U+3fe->U+37b, U+3ff->U+37b, U+37b, \
                        U+3fe->U+37c, U+37c, \
                        U+37e, \
                        U+386->U+3b1, \
                        U+388->U+3b5, \
                        U+389->U+3b7, \
                        U+38a->U+3b9, \
                        U+38c->U+3bf, \
                        U+38e->U+3c5, \
                        U+38f->U+3c9, \
                        U+390->U+3b9, \
                        U+3aa->U+3b9, \
                        U+3ab->U+3c5, \
                        U+3ac->U+3b1, \
                        U+3ad->U+3b5, \
                        U+3ae->U+3b7, \
                        U+3af->U+3b9, \
                        U+3b0->U+3c5, \
                        U+3ca->U+3b9, \
                        U+3cb->U+3c5, \
                        U+3cc->U+3bf, \
                        U+3cd->U+3c5, \
                        U+3ce->U+3c9, \
                        U+3cf->U+3d7, U+3d7, \
                        U+3d0->U+3b2, \
                        U+3d1->U+3b8, \
                        U+3d2->U+3c5, \
                        U+3d3->U+3c5, \
                        U+3d4->U+3c5, \
                        U+3d5->U+3c6, \
                        U+3d6->U+3c0, \
                        U+3d8->U+3d9, U+3d9, \
                        U+3da->U+3db, U+3db, \
                        U+3dc->U+3dd, U+3dd, \
                        U+3de->U+3df, U+3df, \
                        U+3e0->U+3d1, U+3d1, \
                        U+391..U+3a1->U+3b1..U+3c1, U+3b1..U+3c1, \
                        U+3a3..U+3a9->U+3c3..U+3c9, U+3c3..U+3c9, \

Common forum | 1 | 2 | 3 | 4 | 5 | ... | 263 | 264 | 265 | 266 | next »» | Create new thread


Copyright © Sphinx Technologies Inc, 2009