Abstract
Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent searching, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state of the art in topic modeling. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results that are not accurate in inferring the most suitable model parameters. Adapting approaches from community detection in networks, we propose a new algorithm that displays high reproducibility and high accuracy and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure.
Original language | English (US) |
---|---|
Article number | 011007 |
Journal | Physical Review X |
Volume | 5 |
Issue number | 1 |
DOIs | |
State | Published - 2015 |
Externally published | Yes |
Keywords
- Interdisciplinary Physics
ASJC Scopus subject areas
- General Physics and Astronomy