Python source code de-anonymization using nested bigrams

Pegah Hozhabrierdi, Dunai Fuentes Hitos, Chilukuri K Mohan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

An important issue in cybersecurity is the insertion or modification of code by individuals other than the original authors of the code. This motivates research on authorship attribution of unknown source code. We have addressed the deficiencies of previously used feature extraction methods and propose a novel approach: Nested Bigrams. Such features are easy to extract and carry substantial information about the interconnections between the nodes of the abstract syntax tree. We also show that for large number of authors, a Strongly Regularized Feed-forward Neural Network outperforms the Random Forest Classifier used in many code stylometric studies. A new ranking system for reducing the number of features is also proposed, and experiments show that this approach can reduce the feature set to 98 nested bigrams while maintaining a classification accuracy above 90 percent.

Original languageEnglish (US)
Title of host publicationProceedings - 18th IEEE International Conference on Data Mining Workshops, ICDMW 2018
EditorsJeffrey Yu, Zhenhui Li, Hanghang Tong, Feida Zhu
PublisherIEEE Computer Society
Pages23-28
Number of pages6
ISBN (Electronic)9781538692882
DOIs
StatePublished - Feb 7 2019
Event18th IEEE International Conference on Data Mining Workshops, ICDMW 2018 - Singapore, Singapore
Duration: Nov 17 2018Nov 20 2018

Publication series

NameIEEE International Conference on Data Mining Workshops, ICDMW
Volume2018-November
ISSN (Print)2375-9232
ISSN (Electronic)2375-9259

Conference

Conference18th IEEE International Conference on Data Mining Workshops, ICDMW 2018
CountrySingapore
CitySingapore
Period11/17/1811/20/18

Fingerprint

Feedforward neural networks
Feature extraction
Classifiers
Experiments

Keywords

  • abstract syntax tree
  • feature extraction
  • feature ranking
  • source code de-anonymization
  • source code stylometry

ASJC Scopus subject areas

  • Computer Science Applications
  • Software

Cite this

Hozhabrierdi, P., Fuentes Hitos, D., & Mohan, C. K. (2019). Python source code de-anonymization using nested bigrams. In J. Yu, Z. Li, H. Tong, & F. Zhu (Eds.), Proceedings - 18th IEEE International Conference on Data Mining Workshops, ICDMW 2018 (pp. 23-28). [8637444] (IEEE International Conference on Data Mining Workshops, ICDMW; Vol. 2018-November). IEEE Computer Society. https://doi.org/10.1109/ICDMW.2018.00011

Python source code de-anonymization using nested bigrams. / Hozhabrierdi, Pegah; Fuentes Hitos, Dunai; Mohan, Chilukuri K.

Proceedings - 18th IEEE International Conference on Data Mining Workshops, ICDMW 2018. ed. / Jeffrey Yu; Zhenhui Li; Hanghang Tong; Feida Zhu. IEEE Computer Society, 2019. p. 23-28 8637444 (IEEE International Conference on Data Mining Workshops, ICDMW; Vol. 2018-November).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hozhabrierdi, P, Fuentes Hitos, D & Mohan, CK 2019, Python source code de-anonymization using nested bigrams. in J Yu, Z Li, H Tong & F Zhu (eds), Proceedings - 18th IEEE International Conference on Data Mining Workshops, ICDMW 2018., 8637444, IEEE International Conference on Data Mining Workshops, ICDMW, vol. 2018-November, IEEE Computer Society, pp. 23-28, 18th IEEE International Conference on Data Mining Workshops, ICDMW 2018, Singapore, Singapore, 11/17/18. https://doi.org/10.1109/ICDMW.2018.00011
Hozhabrierdi P, Fuentes Hitos D, Mohan CK. Python source code de-anonymization using nested bigrams. In Yu J, Li Z, Tong H, Zhu F, editors, Proceedings - 18th IEEE International Conference on Data Mining Workshops, ICDMW 2018. IEEE Computer Society. 2019. p. 23-28. 8637444. (IEEE International Conference on Data Mining Workshops, ICDMW). https://doi.org/10.1109/ICDMW.2018.00011
Hozhabrierdi, Pegah ; Fuentes Hitos, Dunai ; Mohan, Chilukuri K. / Python source code de-anonymization using nested bigrams. Proceedings - 18th IEEE International Conference on Data Mining Workshops, ICDMW 2018. editor / Jeffrey Yu ; Zhenhui Li ; Hanghang Tong ; Feida Zhu. IEEE Computer Society, 2019. pp. 23-28 (IEEE International Conference on Data Mining Workshops, ICDMW).
@inproceedings{704111d1d1f94cd681c91f343bf0e1d3,
title = "Python source code de-anonymization using nested bigrams",
abstract = "An important issue in cybersecurity is the insertion or modification of code by individuals other than the original authors of the code. This motivates research on authorship attribution of unknown source code. We have addressed the deficiencies of previously used feature extraction methods and propose a novel approach: Nested Bigrams. Such features are easy to extract and carry substantial information about the interconnections between the nodes of the abstract syntax tree. We also show that for large number of authors, a Strongly Regularized Feed-forward Neural Network outperforms the Random Forest Classifier used in many code stylometric studies. A new ranking system for reducing the number of features is also proposed, and experiments show that this approach can reduce the feature set to 98 nested bigrams while maintaining a classification accuracy above 90 percent.",
keywords = "abstract syntax tree, feature extraction, feature ranking, source code de-anonymization, source code stylometry",
author = "Pegah Hozhabrierdi and {Fuentes Hitos}, Dunai and Mohan, {Chilukuri K}",
year = "2019",
month = "2",
day = "7",
doi = "10.1109/ICDMW.2018.00011",
language = "English (US)",
series = "IEEE International Conference on Data Mining Workshops, ICDMW",
publisher = "IEEE Computer Society",
pages = "23--28",
editor = "Jeffrey Yu and Zhenhui Li and Hanghang Tong and Feida Zhu",
booktitle = "Proceedings - 18th IEEE International Conference on Data Mining Workshops, ICDMW 2018",
address = "United States",

}

TY - GEN

T1 - Python source code de-anonymization using nested bigrams

AU - Hozhabrierdi, Pegah

AU - Fuentes Hitos, Dunai

AU - Mohan, Chilukuri K

PY - 2019/2/7

Y1 - 2019/2/7

N2 - An important issue in cybersecurity is the insertion or modification of code by individuals other than the original authors of the code. This motivates research on authorship attribution of unknown source code. We have addressed the deficiencies of previously used feature extraction methods and propose a novel approach: Nested Bigrams. Such features are easy to extract and carry substantial information about the interconnections between the nodes of the abstract syntax tree. We also show that for large number of authors, a Strongly Regularized Feed-forward Neural Network outperforms the Random Forest Classifier used in many code stylometric studies. A new ranking system for reducing the number of features is also proposed, and experiments show that this approach can reduce the feature set to 98 nested bigrams while maintaining a classification accuracy above 90 percent.

AB - An important issue in cybersecurity is the insertion or modification of code by individuals other than the original authors of the code. This motivates research on authorship attribution of unknown source code. We have addressed the deficiencies of previously used feature extraction methods and propose a novel approach: Nested Bigrams. Such features are easy to extract and carry substantial information about the interconnections between the nodes of the abstract syntax tree. We also show that for large number of authors, a Strongly Regularized Feed-forward Neural Network outperforms the Random Forest Classifier used in many code stylometric studies. A new ranking system for reducing the number of features is also proposed, and experiments show that this approach can reduce the feature set to 98 nested bigrams while maintaining a classification accuracy above 90 percent.

KW - abstract syntax tree

KW - feature extraction

KW - feature ranking

KW - source code de-anonymization

KW - source code stylometry

UR - http://www.scopus.com/inward/record.url?scp=85062869904&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85062869904&partnerID=8YFLogxK

U2 - 10.1109/ICDMW.2018.00011

DO - 10.1109/ICDMW.2018.00011

M3 - Conference contribution

AN - SCOPUS:85062869904

T3 - IEEE International Conference on Data Mining Workshops, ICDMW

SP - 23

EP - 28

BT - Proceedings - 18th IEEE International Conference on Data Mining Workshops, ICDMW 2018

A2 - Yu, Jeffrey

A2 - Li, Zhenhui

A2 - Tong, Hanghang

A2 - Zhu, Feida

PB - IEEE Computer Society

ER -