Decomposition of German compound words for effective searching (3)

I have created a token filter for Apache Lucene that works pretty well. You find the code here: http://issues.apache.org/jira/browse/LUCENE-1166.

It works like I described in the older blog entries.

I also created a Swedish hyphenation grammar. It is attached to the blog entry.

Update [2008-03-09]:
I have posted the Swedish grammar to the OFFO-project for inclusion in the next release. The patch is available here: http://sourceforge.net/tracker/index.php?func=detail&aid=1906166&group_id=116740&atid=678288

Update [2008-04-18]:
I am experimenting with replacing the HashMap dictionary lookup with a Lucene index lookup. I have no numbers so far that would show a speedup for the dumb compound word token filter.

AttachmentSize
se.xml31.49 KB

Post new comment

  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • You may post PHP code. You should include <?php ?> tags.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Images can be added to this post.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.