Article,

A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing

T. Tanabe, M. Takahashi, and K. Shudo.
Computer Speech & Language, (2013)

Abstract

Since Sag et al. (2002) highlighted a key problem that had been underappreciated in the past in natural language processing (NLP), namely idiosyncratic multiword expressions (MWEs) such as idioms, quasi-idioms, clichés, quasi-clichés, institutionalized phrases, proverbs, old sayings, etc., and how to deal with them, many attempts have been made to extract these expressions from corpora and construct a lexicon of them. However, no extensive, reliable solution has yet been realized. This paper presents an overview of a comprehensive lexicon of Japanese multiword expressions (Japanese MWE Lexicon: JMWEL), which has been compiled in order to realize linguistically precise and wide-coverage natural Japanese processing systems. The JMWEL is characterized by significant notational, syntactic, and semantic diversity as well as a detailed description of the syntactic functions, structures, and flexibilities of MWEs. The lexicon contains about 111,000 header entries written in kana (phonetic characters) and their almost 820,000 variants written in kana and kanji (ideographic characters). The paper demonstrates the JMWEL's validity, supported mainly by comparing the lexicon with a large-scale Japanese N-gram frequency dataset, namely the LDC2009T08 generated by Google Inc. (Kudo and Kazawa, 2009). The present work is an attempt to provide a tentative answer for Japanese, from outside statistical empiricism, to the question posed by Church (2011): '' How many multiword expressions do people know?''

BibTeX key: tanabe_lexicon_2013
entry type: article
year: 2013
journal: Computer Speech & Language
url: http://www.sciencedirect.com/science/article/pii/S0885230813000600#

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@article{tanabe_lexicon_2013, abstract = {Since Sag et al. (2002) highlighted a key problem that had been underappreciated in the past in natural language processing (NLP), namely idiosyncratic multiword expressions (MWEs) such as idioms, quasi-idioms, clichés, quasi-clichés, institutionalized phrases, proverbs, old sayings, etc., and how to deal with them, many attempts have been made to extract these expressions from corpora and construct a lexicon of them. However, no extensive, reliable solution has yet been realized. This paper presents an overview of a comprehensive lexicon of Japanese multiword expressions (Japanese MWE Lexicon: JMWEL), which has been compiled in order to realize linguistically precise and wide-coverage natural Japanese processing systems. The JMWEL is characterized by significant notational, syntactic, and semantic diversity as well as a detailed description of the syntactic functions, structures, and flexibilities of MWEs. The lexicon contains about 111,000 header entries written in kana (phonetic characters) and their almost 820,000 variants written in kana and kanji (ideographic characters). The paper demonstrates the JMWEL's validity, supported mainly by comparing the lexicon with a large-scale Japanese N-gram frequency dataset, namely the LDC2009T08 generated by Google Inc. (Kudo and Kazawa, 2009). The present work is an attempt to provide a tentative answer for Japanese, from outside statistical empiricism, to the question posed by Church (2011): '' How many multiword expressions do people know?''}, added-at = {2018-11-04T17:02:36.000+0100}, author = {Tanabe, Toshifumi and Takahashi, Masahito and Shudo, Kosho}, biburl = {https://www.bibsonomy.org/bibtex/27352f243b5790b805aae9124ab5a1514/lepsky}, interhash = {c605a5efcce00801d8b7ea3844b57207}, intrahash = {7352f243b5790b805aae9124ab5a1514}, journal = {Computer Speech \& Language}, keywords = {mehrwortbegriffe}, timestamp = {2018-11-04T17:02:36.000+0100}, title = {A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing}, url = {http://www.sciencedirect.com/science/article/pii/S0885230813000600#}, year = 2013 }

BibSonomy

A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on