Zero-Shot Source Code Author Identification: A Lexicon and Layout Independent Approach

Pegah Hozhabrierdi, Dunai Fuentes Hitos, Chilukuri K. Mohan

Research output: Chapter in Book/Entry/PoemConference contribution

Abstract

We tackle the challenge of Zero-Shot identification of authors of source code, which can be used with no prior samples of authors outside of the training data. In our approach, a feedforward neural network is first trained on a multi-class classification task. Then, a substantial part of this network is duplicated and reused to compare code samples. We refer to this design as Feedforward Duplicated Resolver (FDR) model. We propose new input features to train this model, called Variable-Independent Nested Bigrams, extracted from the Abstract Syntax Trees of code samples. These features provide robustness against lexical and layout obfuscation attacks frequently used in plagiarism attempts. This approach performs accurately even on code samples from unknown authors, on data obtained from Google Code Jam, an international coding competition platform. For example, for the task of predicting whether a pair of samples from 43 unknown authors have been written by the same person, we obtain an AUC of 0.96 and 0.91 for non-obfuscated and obfuscated code, respectively.

Original languageEnglish (US)
Title of host publication2020 International Joint Conference on Neural Networks, IJCNN 2020 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728169262
DOIs
StatePublished - Jul 2020
Event2020 International Joint Conference on Neural Networks, IJCNN 2020 - Virtual, Glasgow, United Kingdom
Duration: Jul 19 2020Jul 24 2020

Publication series

NameProceedings of the International Joint Conference on Neural Networks

Conference

Conference2020 International Joint Conference on Neural Networks, IJCNN 2020
Country/TerritoryUnited Kingdom
CityVirtual, Glasgow
Period7/19/207/24/20

Keywords

  • Author identification
  • Obfuscation
  • Source code stylometry
  • Zero-shot learning

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Zero-Shot Source Code Author Identification: A Lexicon and Layout Independent Approach'. Together they form a unique fingerprint.

Cite this