Large Language Models have become integral to software development, yet they frequently generate vulnerable code. Existing code vulnerability detection benchmarks employ binary classification, lacking the CWE-level specificity required for actionable feedback in iterative correction systems. We present ALPHA (Adaptive Learning via Penalty in Hierarchical Assessment), the first function-level Python benchmark that evaluates both LLMs and SAST tools using hierarchically aware, CWE-specific penalties. ALPHA distinguishes between over-generalisation, over-specification, and lateral errors, reflecting practical differences in diagnostic utility. Evaluating seven LLMs and two SAST tools, we find LLMs substantially outperform SAST, though SAST demonstrates higher precision when detections occur. Critically, prediction consistency varies dramatically across models (8.26%-81.87% agreement), with significant implications for feedback-driven systems. We further outline a pathway for future work incorporating ALPHA penalties into supervised fine-tuning, which could provide principled hierarchy-aware vulnerability detection pending empirical validation.
翻译:大型语言模型已成为软件开发的关键组成部分,但其生成的代码常存在安全漏洞。现有的代码漏洞检测基准采用二元分类方法,缺乏迭代修正系统中可操作反馈所需的CWE级别特异性。本文提出ALPHA(基于分层评估惩罚的自适应学习)基准,这是首个在函数级别评估LLM与SAST工具的Python基准,采用具备层次感知能力的CWE特定惩罚机制。ALPHA能够区分过度泛化、过度特化及横向错误,从而反映诊断效用的实际差异。通过对七个LLM模型和两种SAST工具的评估,我们发现LLM整体表现显著优于SAST工具,但SAST在检测发生时展现出更高的精确度。值得注意的是,不同模型间的预测一致性差异巨大(一致率区间为8.26%-81.87%),这对反馈驱动系统具有重要影响。我们进一步规划了将ALPHA惩罚机制融入监督微调的未来研究方向,该方案有望在实证验证后提供具备理论依据的层次感知漏洞检测能力。