AI coding assistants powered by large language models (LLMs) have transformed software development, significantly boosting productivity. While existing benchmarks evaluate the correctness and security of LLM-generated code, they are typically limited to single-turn tasks that do not reflect the iterative nature of real-world development. We introduce MT-Sec, the first benchmark to systematically evaluate both correctness and security in multi-turn coding scenarios. We construct this using a synthetic data pipeline that transforms existing single-turn tasks into semantically aligned multi-turn interaction sequences, allowing reuse of original test suites while modeling the complexity of real-world coding processes. We evaluate 32 open- and closed-source models, and three agent-scaffolding on MT-Sec and observe a consistent 20-27% drop in "correct and secure" outputs from single-turn to multi-turn settings -- even among state-of-the-art models. Beyond full-program generation, we also evaluate models on multi-turn code-diff generation -- an unexplored yet practically relevant setting -- and find that models perform worse here, with increased rates of functionally incorrect and insecure outputs. Finally, we find that while agent scaffoldings boost single-turn code generation performance, they are not quite as effective in multi-turn evaluations. Together, these findings highlight the need for benchmarks that jointly evaluate correctness and security in multi-turn, real-world coding workflows.
翻译:基于大语言模型(LLM)的AI编程助手已经改变了软件开发模式,显著提升了生产效率。尽管现有基准测试能够评估LLM生成代码的正确性与安全性,但这些测试通常局限于单轮任务,未能反映实际开发中的迭代特性。我们提出了MT-Sec——首个系统化评估多轮编码场景中正确性与安全性的基准测试。我们通过合成数据流水线构建该基准,将现有单轮任务转化为语义对齐的多轮交互序列,在复用原始测试套件的同时,模拟真实世界编码流程的复杂性。我们在MT-Sec上评估了32个开源与闭源模型及三种智能体框架,观察到从单轮到多轮场景中"正确且安全"的输出结果呈现20-27%的持续下降——即使在最先进的模型中亦不例外。除完整程序生成外,我们还评估了模型在多轮代码差异生成任务中的表现——这是一个尚未被探索但具有实际意义的场景——发现模型在此类任务中表现更差,功能错误与不安全输出的比例显著上升。最后,我们发现虽然智能体框架能提升单轮代码生成性能,但在多轮评估中其效果并不突出。这些发现共同表明,亟需建立能够联合评估多轮真实世界编码工作流中正确性与安全性的基准测试体系。