Thomas's blog

Decomposition of German compound words for effective searching (3)

I have created a token filter for Apache Lucene that works pretty well. You find the code here:

It works like I described in the older blog entries.

I also created a Swedish hyphenation grammar. It is attached to the blog entry.

Update [2008-03-09]:

Memory profiling a webMethods Integration Server

I am currently trying to find a memory leak in an webMethods Integration Server (6.5SP2).

This is a nightmare. I tried several memory profilers now (YourKit Profiler, JProfiler, JProbe, ...) and all crash the JVM when you try to take a memory snapshot. Currently I don't know why they crash the JVM. I assume it is because webMethods uses native libraries for some old legacy stuff.

With JProfiler I am able to look at the heap at least partially. We will see...

ApacheCon Europe '08 in Amsterdam

I have the opportunity to attend the ApacheCon Europe '08 conference in Amsterdam. I arrived today and what I have seen so far from Amsterdam (airport and central railway station area) makes me think of Stockholm/Sweden.

The conference part starts tomorrow morning (Monday and Tuesday was training day). I have my digital camera with me and add some photos tomorrow.

Here is an image from the conference. More pictures are in the gallery.

Syncing the Ogo with Linux - Progress!

Mirko has started a project at ( with what he has achieved so far. Really nice work Mirko! The syncing is working with some problems Mirko has to iron out (all the current problems are stated in the README file).

Keep up the good work!

Calculator for Linux Huge Page Table Config for Java JVMs

You can speedup Java processes that handle a big heap (several gigabytes) by configuring Linux and the JVM to use large memory pages. This minimizes the load on the TLB tables (which can hold only a limited number of entries) of the processor. For applications that don't have localized memory structures you can easily see a speedup of 50-100% because of the trashing of the TLB caches! But the speedup depends on the type of application you are running of course.

Solr and HTTP caches

As SOLR-127: Make Solr more friendly to external HTTP caches is now committed to the Solr codebase here some thoughts about how to set good values for the "Cache-Control" HTTP header for Solr. See about how to set the values.

When I talk about shared caches this also means browser caches.

Rules of thumb:

Decomposition of German compound words for effective searching (2)

I have done some little experiments with the decomposition algorithm I suggested in the blog earlier (you find it here). For hyphenation I used the hyphenator of the Apache FOP project. The dictionary I got from this page: for mobile devices is now available in a mobile device friendly format as well. Point your mobile device's browser to

Decomposition of German compound words for effective searching

Compound words like Rheindampfschifffahrtsgesellschaftskapitänsstellvertreter 'Rhine steamship-company vice-captain' are very common in the German language (the same is true for all Germanic languages like Swedish).
Most full-text indexing solutions on the market only allow to index words delimited by whitespace (or other characters not part of words like inter punctuation symbols like .,;"). So you would never find sub-words like Kapitän if you don't decompose such long words before indexing.

Syndicate content