The study of Code Stylometry, and in particular Code Authorship Attribution (CAA), aims to analyze coding styles to identify the authors of code samples. CAA has been illustrated to be an important component of automating software engineering (SE) tasks such as bug triaging, fault localization, and test prioritization. In addition, CAA is also important in cybersecurity and software forensics for addressing copyright disputes and detecting plagiarism. Past techniques for CAA tend to leverage hand-crafted code-related features typically carry limitations that prevent proper authorship characterization and lead to sensitivities to adversarial attacks. Recently, transformer-based Language Models (LMs) have shown remarkable efficacy across a range of SE tasks, and in authorship attribution for natural language in the NLP domain. However, their effectiveness in CAA is not well understood. As such, we conduct the first extensive empirical study applying two larger state-of-the-art code LMs, and five smaller code LMs to the task of CAA on six diverse datasets that encompass 12k code snippets written by 463 developers. Furthermore, we perform an in-depth quantitative and qualitative analysis of our studied models' performance on CAA using established interpretability techniques. Our results illustrate important aspects of the behavior of LMs in understanding stylometric code patterns.
翻译:代码风格学(Code Stylometry),尤其是代码作者归属(Code Authorship Attribution, CAA)研究,旨在通过分析编码风格来识别代码样本的作者。已有研究表明,CAA是自动化软件工程(SE)任务(如缺陷分类、故障定位和测试优先级排序)的重要组成。此外,CAA在网络安全与软件取证领域,对于解决版权纠纷和检测剽窃亦具有重要意义。以往CAA技术多依赖手工设计的代码相关特征,这类方法通常存在局限性,无法准确刻画作者风格,且对对抗性攻击敏感。近年来,基于Transformer的语言模型(LMs)在多项软件工程任务及自然语言处理(NLP)领域的作者归属任务中展现出卓越效果。然而,其在CAA中的有效性尚未得到充分理解。为此,我们开展了首次大规模实证研究,将两个大型先进代码语言模型及五个小型代码语言模型应用于CAA任务,并使用六个包含463位开发者编写的12,000个代码片段的不同数据集。此外,我们利用成熟的解释性技术,对所研究模型在CAA任务上的表现进行了深入的定量与定性分析。实验结果揭示了语言模型在理解风格化代码模式时的关键行为特征。