Deep Learning-Based Out-of-distribution Source Code Data Identification: How Far We Have Gone?

Software vulnerabilities (SVs) have become a common, serious, and crucial concern to safety-critical security systems. That leads to significant progress in the use of AI-based methods for software vulnerability detection (SVD). In practice, although AI-based methods have been achieving promising performances in SVD and other domain applications (e.g., computer vision), they are well-known to fail in detecting the ground-truth label of input data (referred to as out-of-distribution, OOD, data) lying far away from the training data distribution (i.e., in-distribution, ID). This drawback leads to serious issues where the models fail to indicate when they are likely mistaken. To address this problem, OOD detectors (i.e., determining whether an input is ID or OOD) have been applied before feeding the input data to the downstream AI-based modules. While OOD detection has been widely designed for computer vision and medical diagnosis applications, automated AI-based techniques for OOD source code data detection have not yet been well-studied and explored. To this end, in this paper, we propose an innovative deep learning-based approach addressing the OOD source code data identification problem. Our method is derived from an information-theoretic perspective with the use of innovative cluster-contrastive learning to effectively learn and leverage source code characteristics, enhancing data representation learning for solving the problem. The rigorous and comprehensive experiments on real-world source code datasets show the effectiveness and advancement of our approach compared to state-of-the-art baselines by a wide margin. In short, on average, our method achieves a significantly higher performance from around 15.27%, 7.39%, and 4.93% on the FPR, AUROC, and AUPR measures, respectively, in comparison with the baselines.

翻译：软件漏洞已成为安全关键系统的一个普遍、严重且至关重要的关注点，这促使基于人工智能的方法在软件漏洞检测方面取得显著进展。实践中，尽管基于人工智能的方法在软件漏洞检测及其他领域（如计算机视觉）中取得了有前景的性能，但众所周知，它们在检测远离训练数据分布（即分布内数据）的输入数据（称为分布外数据）的真实标签时容易失效。这一缺陷导致模型无法在其可能出错时做出提示，从而引发严重问题。为解决此问题，在将输入数据送入下游基于人工智能的模块之前，已应用分布外检测器（即判断输入是分布内还是分布外）。虽然分布外检测已广泛应用于计算机视觉和医学诊断应用，但自动化基于人工智能的源代码数据分布外检测技术尚未得到充分研究和探索。为此，本文提出一种创新的基于深度学习的方法，以解决分布外源代码数据识别问题。我们的方法源自信息论视角，利用创新的聚类对比学习有效学习和利用源代码特征，增强数据表示学习以解决问题。在真实世界源代码数据集上进行的严谨而全面的实验表明，与最先进的基线方法相比，我们的方法在性能上取得了显著优势。简而言之，平均而言，我们的方法在FPR、AUROC和AUPR指标上分别比基线方法高出约15.27%、7.39%和4.93%。