Improved document representation for classification tasks ii ii for the intelligence commiumity

Ozgur Yilmazel, Svetlana Symonenko, Niranjan Balasubramanian, Elizabeth D. Liddy

Research output: Contribution to conferencePaperpeer-review

2 Scopus citations

Abstract

Research within a larger, multi-faceted risk assessment project for the Intelligence Community (IC) combines Natural Language Processing (NLP) and Machine Learning techniques to detect potentially malicious shifts in the semantic content of information either accessed or produced by insiders within an organization. Our hypothesis is that the use of fewer, more discriminative linguistic features can outperform the traditional bag-of-words (BOW) representation in classification tasks. Experiments using the standard Support Vector Machine algorithm and the LibSVM algorithm compared the BOW representation and two NLP representations. Classification results on NLP-based document representation vectors achieved greater precision and recall using forty-nine times fewer features than the BOW representation. The NLP-based representations improved classification performance by producing a lower dimensional but more linearly separable feature space that modeled the problem domain more accurately. Results demonstrate that document representation using sophisticated NLP-extracted features improved text classification effectiveness and efficiency with the SVM and LibSVM algorithms.

Original languageEnglish (US)
Pages76-82
Number of pages7
StatePublished - 2005
Event2005 AAAI Spring Symposium - Stanford, CA, United States
Duration: Mar 21 2005Mar 23 2005

Other

Other2005 AAAI Spring Symposium
Country/TerritoryUnited States
CityStanford, CA
Period3/21/053/23/05

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'Improved document representation for classification tasks ii ii for the intelligence commiumity'. Together they form a unique fingerprint.

Cite this