Inproceedings,

An Improved Corpus of Disease Mentions in PubMed Citations

R. Dogan, and Z. Lu.
Proceedings of the 2012 Workshop on Biomedical Natural Language Processing, page 91--99. Stroudsburg, PA, USA, Association for Computational Linguistics, (2012)

Abstract

The latest discoveries on diseases and their diagnosis/treatment are mostly disseminated in the form of scientific publications. However, with the rapid growth of the biomedical literature and a high level of variation and ambiguity in disease names, the task of retrieving disease-related articles becomes increasingly challenging using the traditional keyword-based approach. An important first step for any disease-related information extraction task in the biomedical literature is the disease mention recognition task. However, despite the strong interest, there has not been enough work done on disease name identification, perhaps because of the difficulty in obtaining adequate corpora. Towards this aim, we created a large-scale disease corpus consisting of 6900 disease mentions in 793 PubMed citations, derived from an earlier corpus. Our corpus contains rich annotations, was developed by a team of 12 annotators (two people per annotation) and covers all sentences in a PubMed abstract. Disease mentions are categorized into Specific Disease, Disease Class, Composite Mention and Modifier categories. When used as the gold standard data for a state-of-the-art machine-learning approach, significantly higher performance can be found on our corpus than the previous one. Such characteristics make this disease name corpus a valuable resource for mining disease-related information from biomedical text. The NCBI corpus is available for download at http://www.ncbi.nlm.nih.gov/CBBresearch/Fellows/Dogan/disease.html.

BibTeX key: Dogan:2012:ICD:2391123.2391135
entry type: inproceedings
address: Stroudsburg, PA, USA
booktitle: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
year: 2012
pages: 91--99
publisher: Association for Computational Linguistics
series: BioNLP '12
acmid: 2391135
location: Montreal, Canada
numpages: 9
url: http://dl.acm.org/citation.cfm?id=2391123.2391135

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@inproceedings{Dogan:2012:ICD:2391123.2391135, abstract = {The latest discoveries on diseases and their diagnosis/treatment are mostly disseminated in the form of scientific publications. However, with the rapid growth of the biomedical literature and a high level of variation and ambiguity in disease names, the task of retrieving disease-related articles becomes increasingly challenging using the traditional keyword-based approach. An important first step for any disease-related information extraction task in the biomedical literature is the disease mention recognition task. However, despite the strong interest, there has not been enough work done on disease name identification, perhaps because of the difficulty in obtaining adequate corpora. Towards this aim, we created a large-scale disease corpus consisting of 6900 disease mentions in 793 PubMed citations, derived from an earlier corpus. Our corpus contains rich annotations, was developed by a team of 12 annotators (two people per annotation) and covers all sentences in a PubMed abstract. Disease mentions are categorized into Specific Disease, Disease Class, Composite Mention and Modifier categories. When used as the gold standard data for a state-of-the-art machine-learning approach, significantly higher performance can be found on our corpus than the previous one. Such characteristics make this disease name corpus a valuable resource for mining disease-related information from biomedical text. The NCBI corpus is available for download at http://www.ncbi.nlm.nih.gov/CBBresearch/Fellows/Dogan/disease.html.}, acmid = {2391135}, added-at = {2016-06-29T10:37:36.000+0200}, address = {Stroudsburg, PA, USA}, author = {Do\u{g}an, Rezarta Islamaj and Lu, Zhiyong}, biburl = {https://www.bibsonomy.org/bibtex/2174bd90ec11a8f551db04de3678ecad7/isaric1}, booktitle = {Proceedings of the 2012 Workshop on Biomedical Natural Language Processing}, interhash = {fb1201e598485b2eeb6aea63eaa5b560}, intrahash = {174bd90ec11a8f551db04de3678ecad7}, keywords = {text_analysis}, location = {Montreal, Canada}, numpages = {9}, pages = {91--99}, publisher = {Association for Computational Linguistics}, series = {BioNLP '12}, timestamp = {2016-06-29T10:37:36.000+0200}, title = {An Improved Corpus of Disease Mentions in PubMed Citations}, url = {http://dl.acm.org/citation.cfm?id=2391123.2391135}, year = 2012 }

BibSonomy

An Improved Corpus of Disease Mentions in PubMed Citations

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on