Modern software development demands code that is maintainable, testable, and scalable by organizing the implementation into modular components with iterative reuse of existing codes. We formalize this iterative, multi-turn paradigm as codeflow and introduce CodeFlowBench, the first benchmark designed to comprehensively evaluate LLMs' ability to perform codeflow - implementing new functionality by reusing existing functions over multiple turns. CodeFlowBench comprises two complementary components: CodeFlowBench-Comp, a core collection of 5,000+ competitive programming problems from Codeforces updated via an automated pipeline and CodeFlowBench-Repo, which is sourced from GitHub repositories to better reflect real-world scenarios. Furthermore, a novel evaluation framework featured dual assessment protocol and structural metrics derived from dependency trees is introduced. Extensive experiments reveal significant performance degradation in multi-turn codeflow scenarios. Furthermore, our in-depth analysis illustrates that model performance inversely correlates with dependency complexity. These findings not only highlight the critical challenges for supporting real-world workflows, but also establish CodeFlowBench as an essential tool for advancing code generation research.
翻译:现代软件开发要求代码具备可维护性、可测试性和可扩展性,这需要通过将实现组织为模块化组件并迭代复用现有代码来实现。我们将这种迭代式、多轮次的范式形式化为代码流,并推出首个专门用于全面评估大语言模型执行代码流能力的基准——CodeFlowBench,其核心任务是通过在多轮对话中复用现有函数来实现新功能。CodeFlowBench包含两个互补组件:CodeFlowBench-Comp(通过自动化流程更新的5000余道Codeforces竞赛编程问题核心集)和CodeFlowBench-Repo(源自GitHub仓库以更好反映实际场景)。此外,我们提出了创新的评估框架,采用双重评估协议和基于依赖树的结构化指标。大量实验表明模型在多轮代码流场景中会出现显著的性能衰减。深度分析进一步揭示模型性能与依赖复杂度呈负相关。这些发现不仅凸显了支持实际工作流的关键挑战,也确立了CodeFlowBench作为推进代码生成研究的重要工具地位。