@unhammer

Using target-language information to train part-of-speech taggers for machine translation

, , and . Machine Translation, 22 (1): 29--66 (2008)

Abstract

Abstract  Although corpus-based approaches to machine translation (MT) are growing in interest, they are not applicable when the translation involves less-resourced language pairs for which there are no parallel corpora available; in those cases, the rule-based approach is the only applicable solution. Most rule-based MT systems makeuse of part-of-speech (PoS) taggers to solve the PoS ambiguities in the source-language texts to translate; those MT systemsrequire accurate PoS taggers to produce reliable translations in the target language (TL). The standard statistical approachto PoS ambiguity resolution (or tagging) uses hidden Markov models (HMM) trained in a supervised way from hand-tagged corpora, an expensive resource not always available,or in an unsupervised way through the Baum-Welch expectation-maximization algorithm; both methods use information only fromthe language being tagged. However, when tagging is considered as an intermediate task for the translation procedure, thatis, when the PoS tagger is to be embedded as a module within an MT system, information from the TL can be (unsupervisedly)used in the training phase to increase the translation quality of the whole MT system. This paper presents a method to trainHMM-based PoS taggers to be used in MT; the new method uses not only information from the source language (SL), as general-purposemethods do, but also information from the TL and from the remaining modules of the MT system in which the PoS tagger is tobe embedded. We find that the translation quality of the MT system embedding a PoS tagger trained in an unsupervised mannerthrough this new method is clearly better than that of the same MT system embedding a PoS tagger trained through the Baum-Welchalgorithm, and comparable to that obtained by embedding a PoS tagger trained in a supervised way from hand-tagged corpora.

Links and resources

Tags