The automatic extraction of information from unstructured sources
has opened up new avenues for querying, organizing, and analyzing
data by drawing upon the clean semantics of structured databases
and the abundance of unstructured data. The field of information
extraction has its genesis in the natural language processing community
where the primary impetus came from competitions centered around
the recognition of named entities like people names and organization
from news articles. As society became more data oriented with easy
online access to both structured and unstructured data, new applications
of structure extraction came around. Now, there is interest in converting
our personal desktops to structured databases, the knowledge in scientific
publications to structured records, and harnessing the Internet for
structured fact finding queries. Consequently, there are many different
communities of researchers bringing in techniques from machine learning,
databases, information retrieval, and computational linguistics for
various aspects of the information extraction problem.
This review is a survey of information extraction research of over
two decades from these diverse communities. We create a taxonomy
of the field along various dimensions derived from the nature of
the extraction task, the techniques used for extraction, the variety
of input resources exploited, and the type of output produced. We
elaborate on rule-based and statistical methods for entity and relationship
extraction. In each case we highlight the different kinds of models
for capturing the diversity of clues driving the recognition process
and the algorithms for training and efficiently deploying the models.
We survey techniques for optimizing the various steps in an information
extraction pipeline, adapting to dynamic data, integrating with existing
entities and handling uncertainty in the extraction process.
%0 Conference Paper
%1 Sarawagi2008
%A Sarawagi, Sunita
%B Foundations and Trends� in Databases
%D 2008
%K imported
%N 3
%P 261-377
%R http:/dx.doi.org/10.1561/1900000003
%T Information Extraction
%V 1
%X The automatic extraction of information from unstructured sources
has opened up new avenues for querying, organizing, and analyzing
data by drawing upon the clean semantics of structured databases
and the abundance of unstructured data. The field of information
extraction has its genesis in the natural language processing community
where the primary impetus came from competitions centered around
the recognition of named entities like people names and organization
from news articles. As society became more data oriented with easy
online access to both structured and unstructured data, new applications
of structure extraction came around. Now, there is interest in converting
our personal desktops to structured databases, the knowledge in scientific
publications to structured records, and harnessing the Internet for
structured fact finding queries. Consequently, there are many different
communities of researchers bringing in techniques from machine learning,
databases, information retrieval, and computational linguistics for
various aspects of the information extraction problem.
This review is a survey of information extraction research of over
two decades from these diverse communities. We create a taxonomy
of the field along various dimensions derived from the nature of
the extraction task, the techniques used for extraction, the variety
of input resources exploited, and the type of output produced. We
elaborate on rule-based and statistical methods for entity and relationship
extraction. In each case we highlight the different kinds of models
for capturing the diversity of clues driving the recognition process
and the algorithms for training and efficiently deploying the models.
We survey techniques for optimizing the various steps in an information
extraction pipeline, adapting to dynamic data, integrating with existing
entities and handling uncertainty in the extraction process.
@inproceedings{Sarawagi2008,
abstract = {The automatic extraction of information from unstructured sources
has opened up new avenues for querying, organizing, and analyzing
data by drawing upon the clean semantics of structured databases
and the abundance of unstructured data. The field of information
extraction has its genesis in the natural language processing community
where the primary impetus came from competitions centered around
the recognition of named entities like people names and organization
from news articles. As society became more data oriented with easy
online access to both structured and unstructured data, new applications
of structure extraction came around. Now, there is interest in converting
our personal desktops to structured databases, the knowledge in scientific
publications to structured records, and harnessing the Internet for
structured fact finding queries. Consequently, there are many different
communities of researchers bringing in techniques from machine learning,
databases, information retrieval, and computational linguistics for
various aspects of the information extraction problem.
This review is a survey of information extraction research of over
two decades from these diverse communities. We create a taxonomy
of the field along various dimensions derived from the nature of
the extraction task, the techniques used for extraction, the variety
of input resources exploited, and the type of output produced. We
elaborate on rule-based and statistical methods for entity and relationship
extraction. In each case we highlight the different kinds of models
for capturing the diversity of clues driving the recognition process
and the algorithms for training and efficiently deploying the models.
We survey techniques for optimizing the various steps in an information
extraction pipeline, adapting to dynamic data, integrating with existing
entities and handling uncertainty in the extraction process.},
added-at = {2013-08-04T14:35:14.000+0200},
author = {Sarawagi, Sunita},
biburl = {https://www.bibsonomy.org/bibtex/2972cafa887799cd60b83f58f63149f2f/francesco.k},
booktitle = {Foundations and Trends� in Databases},
doi = {http:/dx.doi.org/10.1561/1900000003},
file = {:ieSurvey.pdf:PDF},
interhash = {afc767b7f9a7fef896a0672c1e8ff241},
intrahash = {972cafa887799cd60b83f58f63149f2f},
keywords = {imported},
number = 3,
pages = {261-377},
timestamp = {2013-08-04T14:35:15.000+0200},
title = {Information Extraction},
volume = 1,
year = 2008
}