Thinking Sphinx in Arabic/Unicode
0
While using Thinking Sphinx in one of my Rails projects, I needed to search Arabic content. Since Sphinx supports Unicode, I thought that it would be easy, but it was not due to the lack of documentation of Unicode support through Thinking Sphinx. So here is what to do to support Arabic (Unicode) search.
After reading a little in Sphinx documentation, I knew that to support non-English languages I had to create a charset_table for Sphinx to use while indexing my data. After some research, I found a nice charset table for several languages. I went to the configuration file created by Thinking Sphinx (app/config/development.sphinx.conf) and added an English/Arabic charset_table. I stopped, reindexed, and restarted searchd. After that I tried to search Arabic with no luck! I noticed that my new configuration, including charset_table, was gone! Why? Thinking Sphinx regenerates the configuration file before reindexing!
After a lot of research, I discovered that to add your custom configuration, you must create the file app/config/sphinx.yml, which Thinking Sphinx will use to override its default configuration. Hey, why didn't any one tell me that?!
After 2 hours of YAML syntax errors, I did it. Here is my sphinx.yml:
development: &my_settings
enable_star: true
min_prefix_len: 0
min_infix_len: 1
min_word_len: 1
charset_table: "0..9, a..z, _, A..Z->a..z, U+621..U+63a, U+640..U+64a, U+66e..U+66f, U+671..U+6d3, U+6d5, U+6e5..U+6e6, U+6ee..U+6ef, U+6fa..U+6fc, U+6ff"
test:
<<: *my_settings
production:
<<: *my_settings
Other Settings
- min_word_len: 1
Setting the minimum indexed word length to 1 means index everything. - min_prefix_len: 0
Setting the minimum word prefix length to index to 0 disables prefix indexing. If set to a positive number, indexer will index all the possible keyword prefixes (ie. word beginnings) in addition to the keywords themselves. - min_infix_len: 1
Setting the minimum infix length to index to 1 asks the indexer to index all the possible keyword infixes (ie. substrings) in addition to the keywords themselves. This allows wildcard searching by 'start*', '*end', and '*middle*' wildcards. However, indexing infixes will make the index grow significantly (because of many more indexed keywords) and will degrade both indexing and searching times. Note that you can't enable both prefix and infex indexing at the same time - that's why I disabled prefix indexing. - enable_star: true
This enables "star-syntax", or wildcard syntax, when searching through indexes which were created with prefix or infix indexing enabled. It only affects searching, so it can be changed without reindexing by simply restarting searchd.
Now, stop, reindex and restart searchd:
rake thinking_sphinx:stop
rake thinking_sphinx:index
rake thinking_sphinx:start
Finally, for the wildcard search to work, your controller should look something like this:
class PostsController < BaseController
def search
@posts = Post.search "*#{params[:search_query]}*"
end
end
You should be enjoying Arabic search now.
Written By:
Hatem Mahmoud (www.expressionlab.com)
Post a Comment
eSpace podcast Prodcast
Archive
- September 2011
- April 2011
- March 2011
- December 2010
- November 2010
- September 2010
- August 2010
- July 2010
- June 2010
- April 2010
- March 2010
- November 2009
- October 2009
- September 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- November 2008
- October 2008
- September 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- January 2008
- April 2007
- March 2007
Latest Comments
- SpectraMind Commented on Egypt Wins UK's National Outsourcing Association Award
- Rofaida Awad Commented on Go Egypt Go!
- Different Mike Commented on Only idiots change their iPhone root password!
- Mike Commented on Only idiots change their iPhone root password!
- smile Commented on Only idiots change their iPhone root password!

