CodeFlowBench：一个用于复杂代码生成的多轮迭代基准 (CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation) - 专知论文

会员服务 ·

0

代码 · 基准 · 代码生成 · 软件 · 扩展性 ·

CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation

翻译：CodeFlowBench：一个用于复杂代码生成的多轮迭代基准

Sizhe Wang,Zhengren Wang,Dongsheng Ma,Yongan Yu,Rui Ling,Zhiyu Li,Feiyu Xiong,Wentao Zhang

Modern software development demands code that is maintainable, testable, and scalable by organizing the implementation into modular components with iterative reuse of existing codes. We formalize this iterative, multi-turn paradigm as codeflow and introduce CodeFlowBench, the first benchmark designed to comprehensively evaluate LLMs' ability to perform codeflow - implementing new functionality by reusing existing functions over multiple turns. CodeFlowBench comprises two complementary components: CodeFlowBench-Comp, a core collection of 5,000+ competitive programming problems from Codeforces updated via an automated pipeline and CodeFlowBench-Repo, which is sourced from GitHub repositories to better reflect real-world scenarios. Furthermore, a novel evaluation framework featured dual assessment protocol and structural metrics derived from dependency trees is introduced. Extensive experiments reveal significant performance degradation in multi-turn codeflow scenarios. Furthermore, our in-depth analysis illustrates that model performance inversely correlates with dependency complexity. These findings not only highlight the critical challenges for supporting real-world workflows, but also establish CodeFlowBench as an essential tool for advancing code generation research.

翻译：现代软件开发要求代码具备可维护性、可测试性和可扩展性，这需要通过将实现组织为模块化组件并迭代复用现有代码来实现。我们将这种迭代式、多轮次的范式形式化为代码流，并推出首个专门用于全面评估大语言模型执行代码流能力的基准——CodeFlowBench，其核心任务是通过在多轮对话中复用现有函数来实现新功能。CodeFlowBench包含两个互补组件：CodeFlowBench-Comp（通过自动化流程更新的5000余道Codeforces竞赛编程问题核心集）和CodeFlowBench-Repo（源自GitHub仓库以更好反映实际场景）。此外，我们提出了创新的评估框架，采用双重评估协议和基于依赖树的结构化指标。大量实验表明模型在多轮代码流场景中会出现显著的性能衰减。深度分析进一步揭示模型性能与依赖复杂度呈负相关。这些发现不仅凸显了支持实际工作流的关键挑战，也确立了CodeFlowBench作为推进代码生成研究的重要工具地位。

0

相关内容

代码（Code）是专知网的一个重要知识资料文档板块，旨在整理收录论文源代码、复现代码，经典工程代码等，便于用户查阅下载使用。

AI生成代码缺陷综述

AI生成代码缺陷综述

专知会员服务

16+阅读 · 2025年12月8日

如何提升大模型通用推理能力？DeepSeek最新论文《CODEI/O：通过代码输入输出预测凝练推理模式》

如何提升大模型通用推理能力？DeepSeek最新论文《CODEI/O：通过代码输入输出预测凝练推理模式》

专知会员服务

42+阅读 · 2025年2月16日

《深度学习代码智能》综述、基准和工具集

《深度学习代码智能》综述、基准和工具集

专知会员服务

56+阅读 · 2024年1月2日

大模型如何代码建模？上交大等最新《语言模型与代码生成》综述，涵盖了50多个模型、30多个评估任务和500个相关工作

大模型如何代码建模？上交大等最新《语言模型与代码生成》综述，涵盖了50多个模型、30多个评估任务和500个相关工作

专知会员服务

55+阅读 · 2023年11月16日

ChatGPT的代码生成是怎么做的？「基于深度学习的代码生成方法」最新研究进展

ChatGPT的代码生成是怎么做的？「基于深度学习的代码生成方法」最新研究进展

专知会员服务

61+阅读 · 2023年4月1日

终究还是来了，AI卷革程序员！！DeepMind发布媲美普通程序员的AlphaCode

终究还是来了，AI卷革程序员！！DeepMind发布媲美普通程序员的AlphaCode

专知会员服务

27+阅读 · 2022年2月3日

TensorFlow 2.2为keras.Model加入train_step方法，开发者可自由定义模型自动训练过程

TensorFlow 2.2为keras.Model加入train_step方法，开发者可自由定义模型自动训练过程

专知会员服务

36+阅读 · 2020年3月27日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

【电子书推荐】《深度学习之TensorFlow工程化项目实战》电子书以及配套代码及数据集资源，附787页pdf

【电子书推荐】《深度学习之TensorFlow工程化项目实战》电子书以及配套代码及数据集资源，附787页pdf

专知会员服务

211+阅读 · 2019年12月15日

【干货】谷歌Joshua Gordon 《TensorFlow 2.0讲解》，63页PPT

【干货】谷歌Joshua Gordon 《TensorFlow 2.0讲解》，63页PPT

专知会员服务

28+阅读 · 2019年11月2日

TensorFlow 2.0中文开源书项目：日赞700，登上GitHub热榜

TensorFlow 2.0中文开源书项目：日赞700，登上GitHub热榜

机器之心

20+阅读 · 2019年11月17日

【Github】ML-NLP：机器学习、NLP面试中常考到的知识点和代码实现

【Github】ML-NLP：机器学习、NLP面试中常考到的知识点和代码实现

AINLP

10+阅读 · 2019年9月12日

【Github】nlp-tutorial：TensorFlow 和 PyTorch 实现各种NLP模型

【Github】nlp-tutorial：TensorFlow 和 PyTorch 实现各种NLP模型

AINLP

14+阅读 · 2019年9月4日

GitHub趋势榜第一：TensorFlow+PyTorch深度学习资源大汇总

GitHub趋势榜第一：TensorFlow+PyTorch深度学习资源大汇总

新智元

19+阅读 · 2019年6月8日

Github项目推荐 | GAN评估指标的Tensorflow简单实现

Github项目推荐 | GAN评估指标的Tensorflow简单实现

AI研习社

16+阅读 · 2019年4月19日

【EMNLP2018干货】254 页《为NLP研究写出好代码》教程

【EMNLP2018干货】254 页《为NLP研究写出好代码》教程

专知

10+阅读 · 2018年11月2日

Databricks 开源 MLflow 平台，解决机器学习开发四大难点

Databricks 开源 MLflow 平台，解决机器学习开发四大难点

AI研习社

13+阅读 · 2018年6月8日

【干货】使用TensorFlow官方Java API调用TensorFlow模型（附代码）

【干货】使用TensorFlow官方Java API调用TensorFlow模型（附代码）

专知

20+阅读 · 2018年4月22日

tensorflow系列笔记：流程，概念和代码解析

tensorflow系列笔记：流程，概念和代码解析

北京思腾合力科技有限公司

30+阅读 · 2017年11月11日

手把手教TensorFlow（附代码）

手把手教TensorFlow（附代码）

深度学习世界

15+阅读 · 2017年10月17日

基于程序多模态的动态软件水印方法研究

国家自然科学基金

0+阅读 · 2017年12月31日

高速率、高频谱效率码分多址系统地址码设计研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于内容分析的低复杂度高效视频编码方法

国家自然科学基金

0+阅读 · 2015年12月31日

多标记文本数据流分类方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

面向安全关键系统的时间可预测多核代码生成方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

高准度二代测序比对算法

国家自然科学基金

3+阅读 · 2015年12月31日

有限域上的代数曲线在纠错码构造中的几点应用

国家自然科学基金

0+阅读 · 2015年12月31日

流密码可约性高效判别算法存在性的研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于比特置信度的低复杂度多进制LDPC码译码算法

国家自然科学基金

0+阅读 · 2015年12月31日

多纹理多深度的3D视频码率控制研究

国家自然科学基金

0+阅读 · 2015年12月31日

CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning

Arxiv

0+阅读 · 2月3日

Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction

Arxiv

0+阅读 · 2月3日

OmniCode: A Benchmark for Evaluating Software Engineering Agents

Arxiv

0+阅读 · 2月2日

GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding

Arxiv

0+阅读 · 1月28日

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

Arxiv

0+阅读 · 1月17日

CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

Arxiv

0+阅读 · 1月15日

ShortCoder: Knowledge-Augmented Syntax Optimization for Token-Efficient Code Generation

Arxiv

0+阅读 · 1月14日

Assessing and Improving the Representativeness of Code Generation Benchmarks Using Knowledge Units (KUs) of Programming Languages -- An Empirical Study

Arxiv

0+阅读 · 1月7日

CodeEvolve: an open source evolutionary coding agent for algorithm discovery and optimization

Arxiv

0+阅读 · 1月6日

WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics

Arxiv

0+阅读 · 1月5日

VIP会员

文章信息

相关主题

相关VIP内容

AI生成代码缺陷综述

AI生成代码缺陷综述

专知会员服务

16+阅读 · 2025年12月8日

如何提升大模型通用推理能力？DeepSeek最新论文《CODEI/O：通过代码输入输出预测凝练推理模式》

如何提升大模型通用推理能力？DeepSeek最新论文《CODEI/O：通过代码输入输出预测凝练推理模式》

专知会员服务

42+阅读 · 2025年2月16日

《深度学习代码智能》综述、基准和工具集

《深度学习代码智能》综述、基准和工具集

专知会员服务

56+阅读 · 2024年1月2日

大模型如何代码建模？上交大等最新《语言模型与代码生成》综述，涵盖了50多个模型、30多个评估任务和500个相关工作

大模型如何代码建模？上交大等最新《语言模型与代码生成》综述，涵盖了50多个模型、30多个评估任务和500个相关工作

专知会员服务

55+阅读 · 2023年11月16日

ChatGPT的代码生成是怎么做的？「基于深度学习的代码生成方法」最新研究进展

ChatGPT的代码生成是怎么做的？「基于深度学习的代码生成方法」最新研究进展

专知会员服务

61+阅读 · 2023年4月1日

终究还是来了，AI卷革程序员！！DeepMind发布媲美普通程序员的AlphaCode

终究还是来了，AI卷革程序员！！DeepMind发布媲美普通程序员的AlphaCode

专知会员服务

27+阅读 · 2022年2月3日

TensorFlow 2.2为keras.Model加入train_step方法，开发者可自由定义模型自动训练过程

TensorFlow 2.2为keras.Model加入train_step方法，开发者可自由定义模型自动训练过程

专知会员服务

36+阅读 · 2020年3月27日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

【电子书推荐】《深度学习之TensorFlow工程化项目实战》电子书以及配套代码及数据集资源，附787页pdf

【电子书推荐】《深度学习之TensorFlow工程化项目实战》电子书以及配套代码及数据集资源，附787页pdf

专知会员服务

211+阅读 · 2019年12月15日

【干货】谷歌Joshua Gordon 《TensorFlow 2.0讲解》，63页PPT

【干货】谷歌Joshua Gordon 《TensorFlow 2.0讲解》，63页PPT

专知会员服务

28+阅读 · 2019年11月2日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】基于自适应表征的高效视觉建模

《多域作战中融合网络、电子战与动能机动》

AI智能体时代大模型安全风险与攻防新挑战

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

相关资讯

TensorFlow 2.0中文开源书项目：日赞700，登上GitHub热榜

TensorFlow 2.0中文开源书项目：日赞700，登上GitHub热榜

机器之心

20+阅读 · 2019年11月17日

【Github】ML-NLP：机器学习、NLP面试中常考到的知识点和代码实现

【Github】ML-NLP：机器学习、NLP面试中常考到的知识点和代码实现

AINLP

10+阅读 · 2019年9月12日

【Github】nlp-tutorial：TensorFlow 和 PyTorch 实现各种NLP模型

【Github】nlp-tutorial：TensorFlow 和 PyTorch 实现各种NLP模型

AINLP

14+阅读 · 2019年9月4日

GitHub趋势榜第一：TensorFlow+PyTorch深度学习资源大汇总

GitHub趋势榜第一：TensorFlow+PyTorch深度学习资源大汇总

新智元

19+阅读 · 2019年6月8日

Github项目推荐 | GAN评估指标的Tensorflow简单实现

Github项目推荐 | GAN评估指标的Tensorflow简单实现

AI研习社

16+阅读 · 2019年4月19日

【EMNLP2018干货】254 页《为NLP研究写出好代码》教程

【EMNLP2018干货】254 页《为NLP研究写出好代码》教程

专知

10+阅读 · 2018年11月2日

Databricks 开源 MLflow 平台，解决机器学习开发四大难点

Databricks 开源 MLflow 平台，解决机器学习开发四大难点

AI研习社

13+阅读 · 2018年6月8日

【干货】使用TensorFlow官方Java API调用TensorFlow模型（附代码）

【干货】使用TensorFlow官方Java API调用TensorFlow模型（附代码）

专知

20+阅读 · 2018年4月22日

tensorflow系列笔记：流程，概念和代码解析

tensorflow系列笔记：流程，概念和代码解析

北京思腾合力科技有限公司

30+阅读 · 2017年11月11日

手把手教TensorFlow（附代码）

手把手教TensorFlow（附代码）

深度学习世界

15+阅读 · 2017年10月17日

相关论文

CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning

Arxiv

0+阅读 · 2月3日

Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction

Arxiv

0+阅读 · 2月3日

OmniCode: A Benchmark for Evaluating Software Engineering Agents

Arxiv

0+阅读 · 2月2日

GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding

Arxiv

0+阅读 · 1月28日

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

Arxiv

0+阅读 · 1月17日

CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

Arxiv

0+阅读 · 1月15日

ShortCoder: Knowledge-Augmented Syntax Optimization for Token-Efficient Code Generation

Arxiv

0+阅读 · 1月14日

Assessing and Improving the Representativeness of Code Generation Benchmarks Using Knowledge Units (KUs) of Programming Languages -- An Empirical Study

Arxiv

0+阅读 · 1月7日

CodeEvolve: an open source evolutionary coding agent for algorithm discovery and optimization

Arxiv

0+阅读 · 1月6日

WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics

Arxiv

0+阅读 · 1月5日

相关基金

基于程序多模态的动态软件水印方法研究

国家自然科学基金

0+阅读 · 2017年12月31日

高速率、高频谱效率码分多址系统地址码设计研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于内容分析的低复杂度高效视频编码方法

国家自然科学基金

0+阅读 · 2015年12月31日

多标记文本数据流分类方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

面向安全关键系统的时间可预测多核代码生成方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

高准度二代测序比对算法

国家自然科学基金

3+阅读 · 2015年12月31日

有限域上的代数曲线在纠错码构造中的几点应用

国家自然科学基金

0+阅读 · 2015年12月31日

流密码可约性高效判别算法存在性的研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于比特置信度的低复杂度多进制LDPC码译码算法

国家自然科学基金

0+阅读 · 2015年12月31日

多纹理多深度的3D视频码率控制研究

国家自然科学基金

0+阅读 · 2015年12月31日

微信扫码咨询专知VIP会员