monolingual, parallel and annotated corpora. There are fourteen monolingual corpora, including both written and (for some
languages) spoken data for fourteen South Asian languages: Assamese,
Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi,
Sinhala, Tamil, Telegu and Urdu. The EMILLE monolingual corpora contain
approximately
92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati,
Hindi, Punjabi and Urdu).
The
parallel corpus consists of 200,000 words of text in English and its accompanying
translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. The annotated component
includes the Urdu monolingual and parallel corpora annotated for parts-of-speech,
together with twenty written Hindi corpus files annotated to show the nature
of demonstrative use. The corpus is marked up using CES-compliant SGML, and
encoded using Unicode.