Decomposition of German compound words for effective searching

Compound words like Rheindampfschifffahrtsgesellschaftskapitänsstellvertreter 'Rhine steamship-company vice-captain' are very common in the German language (the same is true for all Germanic languages like Swedish).
Most full-text indexing solutions on the market only allow to index words delimited by whitespace (or other characters not part of words like inter punctuation symbols like .,;"). So you would never find sub-words like Kapitän if you don't decompose such long words before indexing.
As a human you naturally see how to decompose the words (sometimes this is hard for humans, too). But how should an algorithm achieve this?

Algorithms are good in analyzing syntax but when it comes to semantics they are lost. There are companies around like Basis Technology that provide such algorithms for money (I don't know how much - so don't ask). In the OSS world I have not found such a technology so far.

Now this is a proposal how such an algorithm can work:

  1. Decompose the word into syllables. We do that only for words that are longer than 7 characters. For the above example you would get: Rhein-dampf-schiff-fahrts-ge-sell-schafts-ka-pi-täns-stell-ver-tre-ter
  2. Recursevly combine syllables and check the generated tokens against a dictionary. This dictionary can be created automatically out of existing content (Wikipedia for example) as the dictionary only needs to state that the word exists.

Optimizations:

  • Remove the genitive 's' from the syllables: Ka-pi-täns -> Ka-pi-tän

Other examples for really long compound words:

  • Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
  • Donaudampfschifffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft

Post new comment

  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • You may post PHP code. You should include <?php ?> tags.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Images can be added to this post.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.