Inbook,

Poor Man’s Stemming: Unsupervised Recognition of Same-Stem Words

.
Springer Berlin / Heidelberg, (2006)
DOI: 10.1007/11880592_25

Abstract

We present a new fully unsupervised human-intervention- free algorithm for stemming for an open class of languages. Since it does not rely on existing large data collections or other linguistic resources than raw text it is especially attractive for low-density languages. The stemming problem is formulated as a decision whether two given words are variants of the same stem and requires that, if so, there is a con- catenative relation between the two. The underlying theory makes no assumptions on whether the language uses a lot of morphology or not, whether it is prefixing or suffixing, or whether affixes are long or short. It does however make the assumption that 1. salient affixes have to be frequent, 2. words essentially are variable length sequences of random characters, and furthermore 3. that a heuristic on what constitutes a systematic affix alteration is valid. Tested on four typologically distant languages, the stemmer shows very promising results in an evaluation against a human-made gold standard.

Tags

Users

  • @brightbyte

Comments and Reviews