Abstract
While generative models such as Latent Dirichlet Allocation (LDA) have proven
fruitful in topic modeling, they often require detailed assumptions and careful
specification of hyperparameters. Such model complexity issues only compound
when trying to generalize generative models to incorporate human input. We
introduce Correlation Explanation (CorEx), an alternative approach to topic
modeling that does not assume an underlying generative model, and instead
learns maximally informative topics through an information-theoretic framework.
This framework naturally generalizes to hierarchical and semi-supervised
extensions with no additional modeling assumptions. In particular, word-level
domain knowledge can be flexibly incorporated within CorEx through anchor
words, allowing topic separability and representation to be promoted with
minimal human intervention. Across a variety of datasets, metrics, and
experiments, we demonstrate that CorEx produces topics that are comparable in
quality to those produced by unsupervised and semi-supervised variants of LDA.
Users
Please
log in to take part in the discussion (add own reviews or comments).