Lucene

LUCENE-1287 has been committed after two years

After two years my improvment for the Lucene Java search library has been committted... See LUCENE-1287 for details.

Committed! (LUCENE-1166) A tokenfilter to decompose compound words

LUCENE-1166 (A tokenfilter to decompose compound words) has been committed today. Hooray...

Decomposition of German compound words for effective searching (3)

I have created a token filter for Apache Lucene that works pretty well. You find the code here: http://issues.apache.org/jira/browse/LUCENE-1166.

It works like I described in the older blog entries.

I also created a Swedish hyphenation grammar. It is attached to the blog entry.

Update [2008-03-09]:

Decomposition of German compound words for effective searching (2)

I have done some little experiments with the decomposition algorithm I suggested in the blog earlier (you find it here). For hyphenation I used the hyphenator of the Apache FOP project. The dictionary I got from this page: http://wiki.services.openoffice.org/wiki/Dictionaries.

Decomposition of German compound words for effective searching

Compound words like Rheindampfschifffahrtsgesellschaftskapitänsstellvertreter 'Rhine steamship-company vice-captain' are very common in the German language (the same is true for all Germanic languages like Swedish).
Most full-text indexing solutions on the market only allow to index words delimited by whitespace (or other characters not part of words like inter punctuation symbols like .,;"). So you would never find sub-words like Kapitän if you don't decompose such long words before indexing.

Syndicate content