52 lines
1.1 KiB
Plaintext
52 lines
1.1 KiB
Plaintext
# German special characters are replaced:
|
|
häufig;haufig
|
|
üor;uor
|
|
björk;bjork
|
|
|
|
# here the stemmer works okay, it maps related words to the same stem:
|
|
abschließen;abschliess
|
|
abschließender;abschliess
|
|
abschließendes;abschliess
|
|
abschließenden;abschliess
|
|
|
|
Tisch;tisch
|
|
Tische;tisch
|
|
Tischen;tisch
|
|
geheimtür;geheimtur
|
|
|
|
Haus;hau
|
|
Hauses;hau
|
|
Häuser;hau
|
|
Häusern;hau
|
|
# here's a case where overstemming occurs, i.e. a word is
|
|
# mapped to the same stem as unrelated words:
|
|
hauen;hau
|
|
|
|
# here's a case where understemming occurs, i.e. two related words
|
|
# are not mapped to the same stem. This is the case with basically
|
|
# all irregular forms:
|
|
Drama;drama
|
|
Dramen;dram
|
|
|
|
# replace "ß" with 'ss':
|
|
Ausmaß;ausmass
|
|
|
|
# fake words to test if suffixes are cut off:
|
|
xxxxxe;xxxxx
|
|
xxxxxs;xxxxx
|
|
xxxxxn;xxxxx
|
|
xxxxxt;xxxxx
|
|
xxxxxem;xxxxx
|
|
xxxxxer;xxxxx
|
|
xxxxxnd;xxxxx
|
|
# the suffixes are also removed when combined:
|
|
xxxxxetende;xxxxx
|
|
|
|
# words that are shorter than four charcters are not changed:
|
|
xxe;xxe
|
|
# -em and -er are not removed from words shorter than five characters:
|
|
xxem;xxem
|
|
xxer;xxer
|
|
# -nd is not removed from words shorter than six characters:
|
|
xxxnd;xxxnd
|