In an era where cyberattacks increasingly target the software supply chain, the ability to accurately attribute code authorship in binary files is critical to improving cybersecurity measures. We propose OCEAN, a contrastive learning-based system for function-level authorship attribution. OCEAN is the first framework to explore code authorship attribution on compiled binaries in an open-world and extreme scenario, where two code samples from unknown authors are compared to determine if they are developed by the same author. To evaluate OCEAN, we introduce new realistic datasets: CONAN, to improve the performance of authorship attribution systems in real-world use cases, and SNOOPY, to increase the robustness of the evaluation of such systems. We use CONAN to train our model and evaluate on SNOOPY, a fully unseen dataset, resulting in an AUROC score of 0.86 even when using high compiler optimizations. We further show that CONAN improves performance by 7% compared to the previously used Google Code Jam dataset. Additionally, OCEAN outperforms previous methods in their settings, achieving a 10% improvement over state-of-the-art SCS-Gan in scenarios analyzing source code. Furthermore, OCEAN can detect code injections from an unknown author in a software update, underscoring its value for securing software supply chains.
翻译:在针对软件供应链的网络攻击日益增多的时代,准确识别二进制文件中代码作者身份的能力对于提升网络安全措施至关重要。我们提出OCEAN,一种基于对比学习的函数级作者身份归属系统。OCEAN是首个在开放世界极端场景下探索编译后二进制文件代码作者归属的框架,该场景通过比较两个未知作者的代码样本以判断它们是否由同一开发者编写。为评估OCEAN,我们引入了两个新的真实数据集:用于提升作者归属系统在实际应用场景中性能的CONAN,以及用于增强此类系统评估鲁棒性的SNOOPY。我们使用CONAN训练模型,并在完全未见过的数据集SNOOPY上进行评估,即使采用高级编译器优化仍获得0.86的AUROC分数。实验进一步表明,与先前使用的Google Code Jam数据集相比,CONAN将性能提升了7%。此外,OCEAN在现有方法的实验设置中表现优异,在分析源代码的场景下较当前最优方法SCS-Gan提升10%。更重要的是,OCEAN能够检测软件更新中未知作者的代码注入行为,这凸显了其在保障软件供应链安全方面的重要价值。