Research within a larger, multi-faceted risk assessment project for the Intelligence Community (IC) combines Natural Language Processing (NLP) and Machine Learning techniques to detect potentially malicious shifts in the semantic content of information either accessed or produced by insiders within an organization. Our hypothesis is that the use of fewer, more discriminative linguistic features can outperform the traditional bag-of-words (BOW) representation in classification tasks. Experiments using the standard Support Vector Machine algorithm and the LibSVM algorithm compared the BOW representation and two NLP representations. Classification results on NLP-based document representation vectors achieved greater precision and recall using forty-nine times fewer features than the BOW representation. The NLP-based representations improved classification performance by producing a lower dimensional but more linearly separable feature space that modeled the problem domain more accurately. Results demonstrate that document representation using sophisticated NLP-extracted features improved text classification effectiveness and efficiency with the SVM and LibSVM algorithms.
|Number of pages
|Published - 2005
|2005 AAAI Spring Symposium - Stanford, CA, United States
Duration: Mar 21 2005 → Mar 23 2005
|2005 AAAI Spring Symposium
|3/21/05 → 3/23/05
ASJC Scopus subject areas