Taco: Better Traditional Chinese support in Lucene / Solr

2012年12月12日星期三

Better Traditional Chinese support in Lucene / Solr

I discover the the approach you may want to know
[ Search Engine Platform ]
I use LucidWorks Enterprise to have quick POC and initial implementation
[ Detail Configuration / Setup ]
1. Have a new field to store/index TC
2. Configure this field by Lucence SmartChineseAnalyzer. You can refer to an instruction in detail configuration.
3. Instruct an Analyzer(solr.MappingCharFilterFactory) to do character/sentence translation from TC to SC with a customized mapping file.
4. Change default defType(query parser) from lucid to edismax because 1st filter supports English only.

There isn't a dedicate/better Traditional Chinese(Big5 code)(we use TC in this article) analyzer/tokenizer in Lucene/Solr. There are some good Simplified Chinese(GB code)(we use SC in this article) analyzer/tokenizer. Finally, I choose SmartChineseAnalyzer due to 2 major features
1. The algorithm : It uses Hidden Markov Model and it was proved that over 98% search accuracy by another implementation(ICTCLASx).
2. It has a good integration/plug-in w/o further works.

2 則留言:

被軟體開發耽誤的廚工2013年1月3日晚上11:40
TC to SC 再交給 SmartChineseAnalyzer 處理，我之前有打算採用，但是有些效能問題。雖然我不清楚您的所開發的 Enterprise Search 是什麼樣的應用情境。(Maybe POI search combined with geo info)..

後來我嘗試另外一個處理方法，就是整理過一個辭典，然後用 mmseg4j 來處理。

最終，處理的文件內容，來源有中英日，所以又回歸採用 StandardAnalyzer 加上在 Solr Client Application 的一些 work-around 來處理(例如根據 tag / click / rating 等等的feedback 來整 search ranking)，獲得了還不錯的搜尋結果。

anyway, thanks for your sharing.

回覆刪除
回覆
Tony Chang, Taco, 小慶2013年1月4日凌晨1:46
謝謝IT大的持續關注&分享.

小弟一開始就打算使用mmseg4j,後來才轉到SmartChineseAnalyzer.

一來小弟在configure mmseg4j後, Solr就起不來, 一直遇到exception;

二來我不確定mmseg4j的演算法(mmseg)是否只支援英文 & GB2312. 以SmartChineseAnalyzer來說, 它不支援Big5(商業版的ICTCLAS好像有支援),因為Big5 & GB2312的UTF8字碼表是不同的. 這也就是小弟目前多了一個TC to SC的轉換;

第三個原因是, 公司另一團隊已經證明 Hidden Markov Model是有效的演算法, 如果我要使用mmseg4j又得要費一番口舌.

至於performance的考量,目前的case不會處理big data等級的資料量, 所以performance還可以接受.
回覆刪除
回覆

新增留言

2012年12月12日 星期三

Better Traditional Chinese support in Lucene / Solr

2 則留言:

2012年12月12日星期三