Text Categorization for Multiple Users Based on Semantic Features From a Machine-Readable Dictionary

Elizabeth D. Liddy, Woojin Paik, Edmund S. Yu

Research output: Contribution to journalArticlepeer-review

20 Scopus citations

Abstract

The text categorization module described here provides a front-end filtering function for the larger DR-LINK text retrieval system [Liddy and Myaeing 1993]. The model evaluates a large incoming stream of documents to determine which documents are sufficiently similar to a profile at the broad subject level to warrant more refined representation and matching. To accomplish this task, each substantive word in a text is first categorized using a feature set based on the semantic Subject Field Codes 1994 assigned to individual word senses in a machine-readable dictionary. When tested on 50 user profiles and 550 megabytes of documents, results indicate that the feature set that is the basis of the text categorization module and the algorithm that establishes the boundary of categories of potentially relevant documents accomplish their tasks with a high level of performance. This means that the category of potentially relevant documents for most profiles would contain at least 80% of all documents later determined to be relevant to the profile. The number of documents in this set would be uniquely determined by the system's category-boundary predictor, and this set is likely to contain less than 5% of the incoming stream of documents.

Original languageEnglish (US)
Pages (from-to)278-295
Number of pages18
JournalACM Transactions on Information Systems (TOIS)
Volume12
Issue number3
DOIs
StatePublished - Jan 7 1994

Keywords

  • semantic vectors
  • subject field coding

ASJC Scopus subject areas

  • Information Systems
  • General Business, Management and Accounting
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Text Categorization for Multiple Users Based on Semantic Features From a Machine-Readable Dictionary'. Together they form a unique fingerprint.

Cite this