Large language models (LLMs) and LLM-based coding agents are now used to generate code from natural-language specifications, yet ensuring such code is both functionally correct and secure remains a challenge. We present DualGauge, the first fully automated framework for jointly evaluating correctness and security of specification-only code generation, supported by DualGauge-Bench, a language-agnostic benchmark of 307 coding tasks each paired with functional and security tests derived from the same specification. Evaluating 10 representative LLMs across Python, C++, and JavaScript, we find that functional correctness substantially overestimates reliable code generation: even the strongest model remains below 15% joint security-functionality success in every language. Common model-side factors--scale, extended thinking, quantization, instruction tuning, and code specialization--do not reliably improve joint performance, suggesting secure-and-correct code generation does not simply emerge from stronger coding capability. Evaluation of 3 leading agentic coding systems (Codex, OpenHands, and Claude Code) shows that iterative scaffolding provides no advantage over direct (LLM-based) generation on specification-only tasks. A qualitative audit reveals failures concentrate at the output contract boundary and in guards that exist but are insufficient--patterns that only joint benchmarking reliably exposes.
翻译:大型语言模型(LLM)及基于LLM的编码代理现被用于从自然语言规范生成代码,然而确保此类代码既功能正确又安全仍是一大挑战。我们提出双标尺——首个全自动框架,用于联合评估仅基于规范的代码生成的正确性与安全性,辅以双标尺-基准测试(DualGauge-Bench),这是一个语言无关的基准测试库,包含307项编码任务,每项任务均附带源于同一规范的功能与安全测试。评估了涵盖Python、C++和JavaScript的10个代表性LLM后,我们发现功能正确性显著高估了可靠代码生成能力:即使用最强大的模型,每种语言中的联合安全-功能成功率仍低于15%。常见的模型侧因素——规模、扩展推理、量化、指令微调及代码专业化——并未可靠提升联合性能,这表明安全且正确的代码生成并非仅凭更强的编码能力即可自发涌现。对三大领先的代理编码系统(Codex、OpenHands和Claude Code)的评估显示,在仅基于规范的任务中,迭代式脚手架方法相对于直接(基于LLM的)生成并无优势。定性审查揭示了故障集中于输出契约边界及存在但不足的防护措施——这些模式唯有联合基准测试才能可靠暴露。