Zusammenfassung
When applying text learning algorithms to
complex tasks, it is tedious and expensive to
hand-label the large amounts of training data
necessary for good performance. This paper
presents bootstrapping as an alternative
approach to learning from large sets of labeled
data. Instead of a large quantity of labeled
data, this paper advocates using a small
amount of seed information and a large collection
of easily-obtained unlabeled data. Bootstrapping
initializes a learner with the seed information;
it then iterates, applying the learner
to calculate labels for the unlabeled data, and
incorporating some of these labels into the
training input for the learner. Two case studies
of this approach are presented. Bootstrapping
for information extraction provides 76% precision
for a 250-word dictionary for extracting
locations from web pages, when starting with
just a few seed locations. Bootstrapping a text
classier from a few keywords per class and
a class hierarchy provides accuracy of 66%, a
level close to human agreement, when placing
computer science research papers into a topic
hierarchy. The success of these two examples
argues for the strength of the general bootstrapping
approach for text learning tasks.
Nutzer