High-reproducibility and high-accuracy method for automated topic classification

Andrea Lancichinetti, M. Irmak Sirer, Jane X. Wang, Daniel Acuna, Konrad Körding, Luís A. Amaral

Research output: Contribution to journalArticlepeer-review

63 Scopus citations

Abstract

Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent searching, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state of the art in topic modeling. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results that are not accurate in inferring the most suitable model parameters. Adapting approaches from community detection in networks, we propose a new algorithm that displays high reproducibility and high accuracy and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure.

Original languageEnglish (US)
Article number011007
JournalPhysical Review X
Volume5
Issue number1
DOIs
StatePublished - 2015
Externally publishedYes

Keywords

  • Interdisciplinary Physics

ASJC Scopus subject areas

  • General Physics and Astronomy

Fingerprint

Dive into the research topics of 'High-reproducibility and high-accuracy method for automated topic classification'. Together they form a unique fingerprint.

Cite this