2012年12月12日 星期三

Better Traditional Chinese support in Lucene / Solr

    I discover the the approach you may want to know
[ Search Engine Platform ]
    I use LucidWorks Enterprise to have quick POC and initial implementation
[ Detail Configuration / Setup ]
1. Have a new field to store/index TC
2. Configure this field by Lucence SmartChineseAnalyzer. You can refer to an instruction in detail configuration.
3. Instruct an Analyzer(solr.MappingCharFilterFactory) to do character/sentence translation from TC to SC with a customized mapping file.
4. Change default defType(query parser) from lucid to edismax because 1st filter supports English only.

    There isn't a dedicate/better Traditional Chinese(Big5 code)(we use TC in this article) analyzer/tokenizer in Lucene/Solr. There are some good Simplified Chinese(GB code)(we use SC in this article) analyzer/tokenizer. Finally, I choose SmartChineseAnalyzer due to 2 major features
1. The algorithm : It uses Hidden Markov Model and it was proved that over 98% search accuracy by another implementation(ICTCLASx).
2. It has a good integration/plug-in w/o further works.

2 則留言:

  1. TC to SC 再交給 SmartChineseAnalyzer 處理,我之前有打算採用,但是有些效能問題。雖然我不清楚您的所開發的 Enterprise Search 是什麼樣的應用情境。(Maybe POI search combined with geo info)..

    後來我嘗試另外一個處理方法,就是整理過一個辭典,然後用 mmseg4j 來處理。

    最終,處理的文件內容,來源有中英日,所以又回歸採用 StandardAnalyzer 加上在 Solr Client Application 的一些 work-around 來處理(例如根據 tag / click / rating 等等的feedback 來整 search ranking),獲得了還不錯的搜尋結果。

    anyway, thanks for your sharing.

    回覆刪除
  2. 謝謝IT大的持續關注&分享.

    小弟一開始就打算使用mmseg4j,後來才轉到SmartChineseAnalyzer.

    一來小弟在configure mmseg4j後, Solr就起不來, 一直遇到exception;

    二來我不確定mmseg4j的演算法(mmseg)是否只支援英文 & GB2312. 以SmartChineseAnalyzer來說, 它不支援Big5(商業版的ICTCLAS好像有支援),因為Big5 & GB2312的UTF8字碼表是不同的. 這也就是小弟目前多了一個TC to SC的轉換;

    第三個原因是, 公司另一團隊已經證明 Hidden Markov Model是有效的演算法, 如果我要使用mmseg4j又得要費一番口舌.

    至於performance的考量,目前的case不會處理big data等級的資料量, 所以performance還可以接受.

    回覆刪除