Compound words like Rheindampfschifffahrtsgesellschaftskapitänsstellvertreter 'Rhine steamship-company vice-captain' are very common in the German language (the same is true for all Germanic languages like Swedish).
Most full-text indexing solutions on the market only allow to index words delimited by whitespace (or other characters not part of words like inter punctuation symbols like .,;"). So you would never find sub-words like Kapitän if you don't decompose such long words before indexing.
As a human you naturally see how to decompose the words (sometimes this is hard for humans, too). But how should an algorithm achieve this?
Algorithms are good in analyzing syntax but when it comes to semantics they are lost. There are companies around like Basis Technology that provide such algorithms for money (I don't know how much - so don't ask). In the OSS world I have not found such a technology so far.
Now this is a proposal how such an algorithm can work:
- Decompose the word into syllables. We do that only for words that are longer than 7 characters. For the above example you would get: Rhein-dampf-schiff-fahrts-ge-sell-schafts-ka-pi-täns-stell-ver-tre-ter
- Recursevly combine syllables and check the generated tokens against a dictionary. This dictionary can be created automatically out of existing content (Wikipedia for example) as the dictionary only needs to state that the word exists.
Optimizations:
- Remove the genitive 's' from the syllables: Ka-pi-täns -> Ka-pi-tän
Other examples for really long compound words:
- Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
- Donaudampfschifffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft



Post new comment