|
Sphinx
Community
Services
Misc
Subscribe in a reader
|
full-text searching in Chinese document.
Common forum |
1 | 2 | 3 | 4 | 5 | ... |
263 | 264 | 265 | 266 | next »» | Create new thread
|
KENCHEN
Name: Ken Chen Posts: 13 |
2007-07-03 04:53:53
| reply!
I tried to do full-text searching in Chinese document one month ago, and it seems well.
There are many articles in the forum to mention about searching in UTF-8 document. I must
remind the 'min_infix_len'.
If 'min_infix_len = 0' in default configuration, you will miss some results. To get most
correct result, you should change to 'min_infix_len = 1'.
But change this configuration will cause too many unnecessary results if the query
keywords are in English.
Any one has experience in Chinese full-text searching?
|
|
shodan
Name: Andrew Aksyonoff Posts: 4117 |
to: KENCHEN, 2007-07-03 10:56:14
| reply!
> But change this configuration will cause too many unnecessary results if the query
> keywords are in English.
Try 1-grams feature. See ngram_len and ngram_chars options in sphinx.conf for that. I
heard that it could yield pretty good results, especially if combined with some query
preprocessing.
|
|
KENCHEN
Name: Ken Chen Posts: 13 |
to: shodan, 2007-07-03 12:59:24
| reply!
> > But change this configuration will cause too many unnecessary results if the query
> keywords are in English.
>
> Try 1-grams feature. See ngram_len and ngram_chars options in sphinx.conf for that. I
> heard that it could yield pretty good results, especially if combined with some query
> preprocessing.
I have read the article http://www.sphinxsearch.com/forum/view.html?id=209 and do the
same configuration. But the configuration will lost some results in searching.
|
|
shodan
Name: Andrew Aksyonoff Posts: 4117 |
to: KENCHEN, 2007-07-03 15:17:48
| reply!
> But the configuration will lost some results in searching.
Could you provide any examples?
|
 |
|
Nordic
Posts: 299 |
to: shodan, 2007-07-03 18:00:40
| reply!
I've done further work on the charset ranges and n-gram to give better support for CJK:
min_word_len = 1
charset_type = utf-8
charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,
A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,
U+00E0..U+00F6, U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101, U+0101,
U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,
U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,
U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, U+0116->U+0117,
U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D, U+011D,
U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, U+0134->U+0135,
U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, U+013C,
U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, U+0143->U+0144,
U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, U+014B,
U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, U+0152->U+0153,
U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159, U+0159,
U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, U+0160->U+0161,
U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, U+0167,
U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, U+016E->U+016F,
U+016F, U+0170->U+0171, U+0171, U+0172->U+0173, U+0173, U+0174->U+0175, U+0175,
U+0176->U+0177, U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A, U+017B->U+017C,
U+017C, U+017D->U+017E, U+017E, U+0410..U+042F->U+0430..U+044F, U+0430..U+044F,
U+05D0..U+05EA, U+0531..U+0556->U+0561..U+0586, U+0561..U+0587, U+0621..U+063A, U+01B9,
U+01BF, U+0640..U+064A, U+0660..U+0669, U+066E, U+066F, U+0671..U+06D3, U+06F0..U+06FF,
U+0904..U+0939, U+0958..U+095F, U+0960..U+0963, U+0966..U+096F, U+097B..U+097F,
U+0985..U+09B9, U+09CE, U+09DC..U+09E3, U+09E6..U+09EF, U+0A05..U+0A39, U+0A59..U+0A5E,
U+0A66..U+0A6F, U+0A85..U+0AB9, U+0AE0..U+0AE3, U+0AE6..U+0AEF, U+0B05..U+0B39,
U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71, U+0B85..U+0BB9, U+0BE6..U+0BF2, U+0C05..U+0C39,
U+0C66..U+0C6F, U+0C85..U+0CB9, U+0CDE..U+0CE3, U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,
U+0D61, U+0D66..U+0D6F, U+0D85..U+0DC6, U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,
U+A807..U+A822, U+0386->U+03B1, U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,
U+0389->U+03B7, U+03AE->U+03B7, U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,
U+03AF->U+03B9, U+03CA->U+03B9, U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,
U+03AB->U+03C5, U+03B0->U+03C5, U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,
U+03CE->U+03C9, U+03C2->U+03C3, U+0391..U+03A1->U+03B1..U+03C1,
U+03A3..U+03A9->U+03C3..U+03C9, U+03B1..U+03C1, U+03C3..U+03C9, U+0E01..U+0E2E,
U+0E30..U+0E3A, U+0E40..U+0E45, U+0E47, U+0E50..U+0E59, U+A000..U+A48F, U+4E00..U+9FBF,
U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF, U+2F800..U+2FA1F, U+2E80..U+2EFF,
U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF, U+3040..U+309F, U+30A0..U+30FF,
U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF, U+3130..U+318F, U+A000..U+A48F,
U+A490..U+A4CF
ngram_len = 1
ngrams_chars = U+4E00..U+9FBF, U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF,
U+2F800..U+2FA1F, U+2E80..U+2EFF, U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF,
U+3040..U+309F, U+30A0..U+30FF, U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF,
U+3130..U+318F, U+A000..U+A48F, U+A490..U+A4CF
|
 |
|
Nordic
Posts: 299 |
to: Nordic, 2007-07-03 18:02:20
| reply!
Oh, and just to let you know the character ranges probably include ranges you do not
require, e.g. other Asian and Arabic scripts.
I do not have one tailored for CJK only.
|
 |
|
KENCHEN
Name: Ken Chen Posts: 13 |
to: shodan, 2007-07-04 02:58:38
| reply!
> > But the configuration will lost some results in searching.
>
> Could you provide any examples?
'min_infix_len = 0':
Sphinx 0.9.7
Copyright (c) 2001-2007, Andrew Aksyonoff
index 'profile': query '«Óô ': returned 89 matches of 89 total in 0.000 sec
displaying matches:
1. document=29760, weight=2
( delete )
89. document=115680, weight=1
words:
1. '«Óô': 89 documents, 93 hits
'min_infix_len = 1':
Sphinx 0.9.7
Copyright (c) 2001-2007, Andrew Aksyonoff
index 'profile': query '«Óô ': returned 1000 matches of 1229 total in 0.000 sec
displaying matches:
1. document=6, weight=3
2. document=2627, weight=3
( delete )
999. document=90077, weight=1
1000. document=90219, weight=1 (reach the limit)
words:
1. '«Óô': 1229 documents, 1633 hit
|
 |
|
KENCHEN
Name: Ken Chen Posts: 13 |
to: shodan, 2007-07-04 03:09:00
| reply!
> > But the configuration will lost some results in searching.
>
> Could you provide any examples?
additional .. I search the keyword in mysql, there are 1918 results:
+----------+
| COUNT(*) |
+----------+
| 1918 |
+----------+
1 row in set (1.34 sec)
The real results should more than mysql can find.
|
 |
|
shodan
Name: Andrew Aksyonoff Posts: 4117 |
to: KENCHEN, 2007-07-04 03:51:00
| reply!
> 'min_infix_len = 0':
> index 'profile': query '«Óô ': returned 89 matches of 89 total in 0.000 sec
Are n-grams enabled?
> index 'profile': query '«Óô ': returned 1000 matches of 1229 total in 0.000 sec
>
> displaying matches:
> 1. document=6, weight=3
> 2. document=2627, weight=3
As I understand this is almost OK but there some matches which are found by MySQL and not
Sphinx.
Please order the result sets by ID in both MySQL and Sphinx, identify some record which
is found MySQL but not Sphinx, dump that record and email it to me along with your
sphinx.conf file.
|
 |
|
KENCHEN
Name: Ken Chen Posts: 13 |
to: shodan, 2007-07-10 15:11:27
| reply!
> > 'min_infix_len = 0':
> > index 'profile': query '«Óô ': returned 89 matches of 89 total in 0.000 sec
>
> Are n-grams enabled?
>
> > index 'profile': query '«Óô ': returned 1000 matches of 1229 total in 0.000 sec
> >
> > displaying matches:
> > 1. document=6, weight=3
> > 2. document=2627, weight=3
>
> As I understand this is almost OK but there some matches which are found by MySQL and not
> Sphinx.
>
> Please order the result sets by ID in both MySQL and Sphinx, identify some record which
> is found MySQL but not Sphinx, dump that record and email it to me along with your
> sphinx.conf file.
I'm apologize for my late.
The results are from a production server, I will send you some fields and other necessary
files.
Thanks in advanced!
|
 |
|
marlboromoo
Name: moo Posts: 8 |
to: Nordic, 2007-10-01 09:38:11
| reply!
> I've done further work on the charset ranges and n-gram to give better support for CJK:
oh yaaaaa !!
it's work , thxs a lot :D
|
Common forum |
1 | 2 | 3 | 4 | 5 | ... |
263 | 264 | 265 | 266 | next »» | Create new thread
|