TY - GEN
T1 - Zero-Shot Source Code Author Identification
T2 - 2020 International Joint Conference on Neural Networks, IJCNN 2020
AU - Hozhabrierdi, Pegah
AU - Hitos, Dunai Fuentes
AU - Mohan, Chilukuri K.
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/7
Y1 - 2020/7
N2 - We tackle the challenge of Zero-Shot identification of authors of source code, which can be used with no prior samples of authors outside of the training data. In our approach, a feedforward neural network is first trained on a multi-class classification task. Then, a substantial part of this network is duplicated and reused to compare code samples. We refer to this design as Feedforward Duplicated Resolver (FDR) model. We propose new input features to train this model, called Variable-Independent Nested Bigrams, extracted from the Abstract Syntax Trees of code samples. These features provide robustness against lexical and layout obfuscation attacks frequently used in plagiarism attempts. This approach performs accurately even on code samples from unknown authors, on data obtained from Google Code Jam, an international coding competition platform. For example, for the task of predicting whether a pair of samples from 43 unknown authors have been written by the same person, we obtain an AUC of 0.96 and 0.91 for non-obfuscated and obfuscated code, respectively.
AB - We tackle the challenge of Zero-Shot identification of authors of source code, which can be used with no prior samples of authors outside of the training data. In our approach, a feedforward neural network is first trained on a multi-class classification task. Then, a substantial part of this network is duplicated and reused to compare code samples. We refer to this design as Feedforward Duplicated Resolver (FDR) model. We propose new input features to train this model, called Variable-Independent Nested Bigrams, extracted from the Abstract Syntax Trees of code samples. These features provide robustness against lexical and layout obfuscation attacks frequently used in plagiarism attempts. This approach performs accurately even on code samples from unknown authors, on data obtained from Google Code Jam, an international coding competition platform. For example, for the task of predicting whether a pair of samples from 43 unknown authors have been written by the same person, we obtain an AUC of 0.96 and 0.91 for non-obfuscated and obfuscated code, respectively.
KW - Author identification
KW - Obfuscation
KW - Source code stylometry
KW - Zero-shot learning
UR - http://www.scopus.com/inward/record.url?scp=85093843824&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85093843824&partnerID=8YFLogxK
U2 - 10.1109/IJCNN48605.2020.9207647
DO - 10.1109/IJCNN48605.2020.9207647
M3 - Conference contribution
AN - SCOPUS:85093843824
T3 - Proceedings of the International Joint Conference on Neural Networks
BT - 2020 International Joint Conference on Neural Networks, IJCNN 2020 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 19 July 2020 through 24 July 2020
ER -