A cross-collection mixture model for comparative text mining

Cheng Xiang Zhai, Atulya Velivelli, Bei Yu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

186 Scopus citations

Abstract

In this paper, we define and study a novel text mining problem, which we refer to as Comparative Text Mining (CTM). Given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and differences of these collections along each common theme. This general problem subsumes many interesting applications, including business intelligence and opinion summarization. We propose a generative probabilistic mixture model for comparative text mining. The model simultaneously performs cross-collection clustering and withincollection clustering, and can be applied to an arbitrary set of comparable text collections. The model can be estimated efficiently using the Expectation-Maximization (EM) algorithm. We evaluate the model on two different text data sets (i.e., a news article data set and a laptop review data set), and compare it with a baseline clustering method also based on a mixture model. Experiment results show that the model is quite effective in discovering the latent common themes across collections and performs significantly better than our baseline mixture model.

Original languageEnglish (US)
Title of host publicationKDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages743-748
Number of pages6
ISBN (Print)1581138881, 9781581138887
DOIs
StatePublished - 2004
EventKDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - Seattle, WA, United States
Duration: Aug 22 2004Aug 25 2004

Publication series

NameKDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Other

OtherKDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
CountryUnited States
CitySeattle, WA
Period8/22/048/25/04

Keywords

  • Clustering
  • Comparative text mining
  • Mixture models

ASJC Scopus subject areas

  • Engineering(all)

Fingerprint Dive into the research topics of 'A cross-collection mixture model for comparative text mining'. Together they form a unique fingerprint.

Cite this