Source Code Authorship Attribution (SCAA) is crucial for software classification because it provides insights into the origin and behavior of software. By accurately identifying the author or group behind a piece of code, experts can better understand the motivations and techniques of developers. In the cybersecurity era, this attribution helps trace the source of malicious software, identify patterns in the code that may indicate specific threat actors or groups, and ultimately enhance threat intelligence and mitigation strategies. This paper presents AuthAttLyzer-V2, a new source code feature extractor for SCAA, focusing on lexical, semantic, syntactic, and N-gram features. Our research explores author identification in C++ by examining 24,000 source code samples from 3,000 authors. Our methodology integrates Random Forest, Gradient Boosting, and XGBoost models, enhanced with SHAP for interpretability. The study demonstrates how ensemble models can effectively discern individual coding styles, offering insights into the unique attributes of code authorship. This approach is pivotal in understanding and interpreting complex patterns in authorship attribution, especially for malware classification.
翻译:源代码作者归属识别对于软件分类至关重要,因为它能揭示软件的来源与行为特征。通过准确识别代码背后的作者或团队,专家可以更好地理解开发者的动机与技术手段。在网络安全时代,这种归属识别有助于追溯恶意软件的来源,识别代码中可能指向特定威胁行为者或组织的模式,并最终提升威胁情报与缓解策略的有效性。本文提出AuthAttLyzer-V2——一种面向源代码作者归属识别的新型特征提取器,专注于词汇、语义、句法与N元语法特征。本研究通过分析来自3,000位作者的24,000份C++源代码样本,探索作者身份识别方法。我们的方法论融合了随机森林、梯度提升与XGBoost模型,并引入SHAP框架以增强可解释性。研究表明,集成模型能有效辨识个体编码风格,从而揭示代码作者身份的独特属性。该方法对于理解和解析作者归属中的复杂模式具有关键意义,尤其在恶意软件分类领域。