Large language models (LLMs) have demonstrated impressive capabilities across a wide range of coding tasks, including summarization, translation, completion, and code generation. Despite these advances, detecting code vulnerabilities remains a challenging problem for LLMs. In-context learning (ICL) has emerged as an effective mechanism for improving model performance by providing a small number of labeled examples within the prompt. Prior work has shown, however, that the effectiveness of ICL depends critically on how these few-shot examples are selected. In this paper, we study two intuitive criteria for selecting few-shot examples for ICL in the context of code vulnerability detection. The first criterion leverages model behavior by prioritizing samples on which the LLM consistently makes mistakes, motivated by the intuition that such samples can expose and correct systematic model weaknesses. The second criterion selects examples based on semantic similarity to the query program, using k-nearest-neighbor retrieval to identify relevant contexts. We conduct extensive evaluations using open-source LLMs and datasets spanning multiple programming languages. Our results show that for Python and JavaScript, careful selection of few-shot examples can lead to measurable performance improvements in vulnerability detection. In contrast, for C and C++ programs, few-shot example selection has limited impact, suggesting that more powerful but also more expensive approaches, such as re-training or fine-tuning, may be required to substantially improve model performance.
翻译:大语言模型(LLM)在广泛的编码任务中展现出卓越的能力,包括代码摘要、翻译、补全和生成。尽管取得了这些进展,检测代码漏洞对于LLM而言仍然是一个具有挑战性的问题。上下文学习(ICL)通过在小样本提示中提供少量标注示例,已成为提升模型性能的有效机制。然而,先前的研究表明,ICL的有效性在很大程度上取决于这些少样本示例的选择方式。本文针对代码漏洞检测场景,研究了两种直观的少样本示例选择准则。第一种准则利用模型行为,优先选择LLM持续预测错误的样本,其动机在于此类样本能够暴露并纠正模型的系统性缺陷。第二种准则基于与查询程序的语义相似性选择示例,采用k近邻检索来识别相关上下文。我们使用开源LLM和涵盖多种编程语言的数据集进行了广泛评估。结果表明,对于Python和JavaScript,精心选择少样本示例能够显著提升漏洞检测的性能。相比之下,对于C和C++程序,少样本示例选择的影响有限,这表明可能需要采用更强大但成本也更高的方法(如重新训练或微调)来实质性提升模型性能。