RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area. Existing benchmarks often fall short by relying on synthetic vulnerabilities or evaluating functional correctness in isolation, failing to capture the complex interplay between functionality and security found in real-world software. To address this gap, we introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories. Our methodology employs a multi-stage pipeline that combines systematic SAST scanning with CodeQL, LLM-based false positive elimination, and rigorous human expert validation. The resulting benchmark contains 105 instances grounded in real-word repository contexts, spanning 19 Common Weakness Enumeration (CWE) types and exhibiting a wide diversity of data flow complexities, including vulnerabilities with up to 34-hop inter-procedural dependencies. Using RealSec-bench, we conduct an extensive empirical study on 5 popular LLMs. We introduce a novel composite metric, SecurePass@K, to assess both functional correctness and security simultaneously. We find that while Retrieval-Augmented Generation (RAG) techniques can improve functional correctness, they provide negligible benefits to security. Furthermore, explicitly prompting models with general security guidelines often leads to compilation failures, harming functional correctness without reliably preventing vulnerabilities. Our work highlights the gap between functional and secure code generation in current LLMs.

翻译：大型语言模型（LLM）在代码生成方面展现出卓越能力，但其生成安全代码的熟练度仍是一个关键且尚未充分探索的领域。现有基准通常存在不足，它们依赖于合成漏洞或孤立地评估功能正确性，未能捕捉真实世界软件中功能与安全性之间复杂的相互作用。为弥补这一空白，我们引入了RealSec-bench，这是一个从真实世界、高风险的Java仓库精心构建的、用于安全代码生成的新基准。我们的方法采用多阶段流程，结合了基于CodeQL的系统性静态应用安全测试（SAST）扫描、基于LLM的误报消除以及严格的人类专家验证。最终构建的基准包含105个基于真实仓库上下文的实例，涵盖19种常见缺陷枚举（CWE）类型，并展现出广泛的数据流复杂性多样性，包括具有多达34跳跨过程依赖的漏洞。利用RealSec-bench，我们对5个流行的LLM进行了广泛的实证研究。我们引入了一个新颖的复合指标SecurePass@K，以同时评估功能正确性和安全性。我们发现，虽然检索增强生成（RAG）技术可以提高功能正确性，但它们对安全性的益处微乎其微。此外，明确提示模型遵循通用安全指南常常导致编译失败，损害功能正确性，却不能可靠地防止漏洞。我们的工作凸显了当前LLM在功能性与安全性代码生成之间的差距。