After two years my improvment for the Lucene Java search library has been committted... See LUCENE-1287 for details.
Lucene
Committed! (LUCENE-1166) A tokenfilter to decompose compound words
Submitted by Thomas on Fri, 2008-05-16 13:57LUCENE-1166 (A tokenfilter to decompose compound words) has been committed today. Hooray...
Decomposition of German compound words for effective searching (3)
Submitted by Thomas on Fri, 2008-04-18 16:17I have created a token filter for Apache Lucene that works pretty well. You find the code here: http://issues.apache.org/jira/browse/LUCENE-1166.
It works like I described in the older blog entries.
I also created a Swedish hyphenation grammar. It is attached to the blog entry.
Update [2008-03-09]:
Decomposition of German compound words for effective searching (2)
Submitted by Thomas on Sat, 2008-01-26 18:15I have done some little experiments with the decomposition algorithm I suggested in the blog earlier (you find it here). For hyphenation I used the hyphenator of the Apache FOP project. The dictionary I got from this page: http://wiki.services.openoffice.org/wiki/Dictionaries.
Decomposition of German compound words for effective searching
Submitted by Thomas on Sat, 2008-01-12 18:01Compound words like Rheindampfschifffahrtsgesellschaftskapitänsstellvertreter 'Rhine steamship-company vice-captain' are very common in the German language (the same is true for all Germanic languages like Swedish).
Most full-text indexing solutions on the market only allow to index words delimited by whitespace (or other characters not part of words like inter punctuation symbols like .,;"). So you would never find sub-words like Kapitän if you don't decompose such long words before indexing.


