Python source code de-anonymization using nested bigrams

Pegah Hozhabrierdi, Dunai Fuentes Hitos, Chilukuri K. Mohan

Research output: Chapter in Book/Entry/PoemConference contribution

4 Scopus citations

Abstract

An important issue in cybersecurity is the insertion or modification of code by individuals other than the original authors of the code. This motivates research on authorship attribution of unknown source code. We have addressed the deficiencies of previously used feature extraction methods and propose a novel approach: Nested Bigrams. Such features are easy to extract and carry substantial information about the interconnections between the nodes of the abstract syntax tree. We also show that for large number of authors, a Strongly Regularized Feed-forward Neural Network outperforms the Random Forest Classifier used in many code stylometric studies. A new ranking system for reducing the number of features is also proposed, and experiments show that this approach can reduce the feature set to 98 nested bigrams while maintaining a classification accuracy above 90 percent.

Original languageEnglish (US)
Title of host publicationProceedings - 18th IEEE International Conference on Data Mining Workshops, ICDMW 2018
EditorsHanghang Tong, Zhenhui Li, Feida Zhu, Jeffrey Yu
PublisherIEEE Computer Society
Pages23-28
Number of pages6
ISBN (Electronic)9781538692882
DOIs
StatePublished - Jul 2 2018
Event18th IEEE International Conference on Data Mining Workshops, ICDMW 2018 - Singapore, Singapore
Duration: Nov 17 2018Nov 20 2018

Publication series

NameIEEE International Conference on Data Mining Workshops, ICDMW
Volume2018-November
ISSN (Print)2375-9232
ISSN (Electronic)2375-9259

Conference

Conference18th IEEE International Conference on Data Mining Workshops, ICDMW 2018
Country/TerritorySingapore
CitySingapore
Period11/17/1811/20/18

Keywords

  • abstract syntax tree
  • feature extraction
  • feature ranking
  • source code de-anonymization
  • source code stylometry

ASJC Scopus subject areas

  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'Python source code de-anonymization using nested bigrams'. Together they form a unique fingerprint.

Cite this