FixEval: Execution-based Evaluation of Program Fixes for Programming Problems - 专知论文

会员服务 ·

0

模型生成 · 代码 · 软件缺陷 · HTTPS · 编程 ·

2023 年 3 月 30 日

FixEval: Execution-based Evaluation of Program Fixes for Programming Problems

翻译：FixEval：基于执行的编程问题修复评估方法

Md Mahim Anjum Haque,Wasi Uddin Ahmad,Ismini Lourentzou,Chris Brown

The complexity of modern software has led to a drastic increase in the time and cost associated with detecting and rectifying software bugs. In response, researchers have explored various methods to automatically generate fixes for buggy code. However, due to the large combinatorial space of possible fixes for any given bug, few tools and datasets are available to evaluate model-generated fixes effectively. To address this issue, we introduce FixEval, a benchmark comprising of buggy code submissions to competitive programming problems and their corresponding fixes. FixEval offers an extensive collection of unit tests to evaluate the correctness of model-generated program fixes and assess further information regarding time, memory constraints, and acceptance based on a verdict. We consider two Transformer language models pretrained on programming languages as our baseline and compare them using match-based and execution-based evaluation metrics. Our experiments show that match-based metrics do not reflect model-generated program fixes accurately. At the same time, execution-based methods evaluate programs through all cases and scenarios designed explicitly for that solution. Therefore, we believe FixEval provides a step towards real-world automatic bug fixing and model-generated code evaluation. The dataset and models are open-sourced at https://github.com/mahimanzum/FixEval.

翻译：现代软件的复杂性导致检测和修复软件缺陷所需的时间和成本大幅增加。为此，研究人员探索了多种自动生成错误代码修复方案的方法。然而，由于给定缺陷可能对应的修复方案存在巨大的组合空间，目前可用于有效评估模型生成修复方案的工具有限。针对这一问题，我们提出FixEval基准测试集，包含针对编程竞赛问题的缺陷代码提交及其对应修复方案。FixEval提供丰富的单元测试集，用于评估模型生成程序修复的正确性，并基于判定结果进一步获取时间、内存约束及验收状态等信息。我们以两种预训练于编程语言的Transformer语言模型作为基线模型，采用基于匹配和基于执行的评估指标进行比较。实验表明，基于匹配的指标无法准确反映模型生成的程序修复效果，而基于执行的方法则能通过针对解决方案专门设计的全部用例和场景对程序进行评估。因此，我们认为FixEval为面向真实场景的自动缺陷修复及模型生成代码评估迈出了重要一步。数据集与模型已开源至https://github.com/mahimanzum/FixEval。

0

相关内容

模型生成

【ACL2022-华盛顿大学】生成知识促进常识推理，Generated Knowledge Prompting for Commonsense Reasoning

【ACL2022-华盛顿大学】生成知识促进常识推理，Generated Knowledge Prompting for Commonsense Reasoning

专知会员服务

26+阅读 · 2022年3月1日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

知识图谱嵌入模型的概率标定,Probability Calibration for Knowledge Graph Embedding Models

专知会员服务

36+阅读 · 2020年5月11日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日

【O’Reilly讲座】基于深度学习的异常检测方法用于检测大型数据集的质量：Anomaly detection using deep learning to measure the quality of large datasets

【O’Reilly讲座】基于深度学习的异常检测方法用于检测大型数据集的质量：Anomaly detection using deep learning to measure the quality of large datasets

专知会员服务

31+阅读 · 2020年1月11日

【ISWC2019教程】可扩展可持续知识图谱构建，251页ppt，Scalable construction of sustainable knowledge bases

【ISWC2019教程】可扩展可持续知识图谱构建，251页ppt，Scalable construction of sustainable knowledge bases

专知会员服务

47+阅读 · 2019年12月1日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

LibRec 精选：推荐系统的论文与源码

LibRec 精选：推荐系统的论文与源码

LibRec智能推荐

14+阅读 · 2018年11月29日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【论文推荐】最新六篇图像描述生成相关论文—视频摘要、注意力张量积、非自回归神经序列模型、副词识别、多主体、多样性度量

【论文推荐】最新六篇图像描述生成相关论文—视频摘要、注意力张量积、非自回归神经序列模型、副词识别、多主体、多样性度量

专知

10+阅读 · 2018年3月2日

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

专知

15+阅读 · 2018年2月3日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

骨髓源性巨噬细胞microRNA-155对动脉粥样硬化的调控机制

国家自然科学基金

0+阅读 · 2016年12月31日

面向星载综合电子设备的智能BIT关键技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

MSCs来源exosomes调控TGF―β/Smad信号通路介导的EMT在修复受损子宫内膜中的作用及机制

国家自然科学基金

0+阅读 · 2014年12月31日

PTN调控小胶质细胞功能异质性对多发性硬化脑白质早期损害的修复机制

国家自然科学基金

0+阅读 · 2013年12月31日

面向协作生成服务的社交搜索研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于Dectin-1受体识别的酵母葡聚糖酶解片段的链结构及构效关系的研究

国家自然科学基金

0+阅读 · 2013年12月31日

高硫中低温下选择性还原NOx催化体系及原位机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

软件自动修复技术研究

国家自然科学基金

1+阅读 · 2012年12月31日

可信软件过程建模、分析、执行关键技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

Solving NLP Problems through Human-System Collaboration: A Discussion-based Approach

Arxiv

0+阅读 · 2023年5月19日

Execution-Based Evaluation for Open-Domain Code Generation

Arxiv

0+阅读 · 2023年5月19日

Leveraging ChatGPT for Power System Programming Tasks

Arxiv

1+阅读 · 2023年5月18日

Differentiable Collision Detection for a Set of Convex Primitives

Arxiv

0+阅读 · 2023年5月18日

Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

Arxiv

0+阅读 · 2023年5月17日

The GitHub Development Workflow Automation Ecosystems

Arxiv

0+阅读 · 2023年5月17日

Can Language Models Solve Graph Problems in Natural Language?

Arxiv

0+阅读 · 2023年5月17日

A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions

Arxiv

16+阅读 · 2023年2月9日

Deep Learning for Generic Object Detection: A Survey

Deep Learning for Generic Object Detection: A Survey

Arxiv

14+阅读 · 2018年9月6日

Approaches for Enriching and Improving Textual Knowledge Bases

Arxiv

15+阅读 · 2018年4月20日

VIP会员

文章信息

相关主题

最新内容

无人机自主控制与人工智能：系统性综述

无人机自主控制与人工智能：系统性综述

专知会员服务

10+阅读 · 今天7:25

巡飞弹与反无人机系统——现代战场的两大支柱

巡飞弹与反无人机系统——现代战场的两大支柱

专知会员服务

3+阅读 · 今天6:54

《打造“黄金舰队”》57页报告

《打造“黄金舰队”》57页报告

专知会员服务

3+阅读 · 今天6:52

《北约数字教官网络发展路径》128页报告

《北约数字教官网络发展路径》128页报告

专知会员服务

2+阅读 · 今天6:33

ECCV 2026 | MIMFlow：MIM与归一化流统一图像生成

ECCV 2026 | MIMFlow：MIM与归一化流统一图像生成

专知会员服务

7+阅读 · 6月25日

超越自回归边界：扩散模型、世界模型与SSM如何重塑代码智能

超越自回归边界：扩散模型、世界模型与SSM如何重塑代码智能

专知会员服务

6+阅读 · 6月25日

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

专知会员服务

9+阅读 · 6月25日

网状网络及其在军事领域的运用

网状网络及其在军事领域的运用

专知会员服务

7+阅读 · 6月25日

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

专知会员服务

8+阅读 · 6月25日

无美国参与的欧洲战争方式（万字长文）

无美国参与的欧洲战争方式（万字长文）

专知会员服务

8+阅读 · 6月25日

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

专知会员服务

10+阅读 · 6月25日

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

专知会员服务

9+阅读 · 6月25日

《国防领域敏感性分析白皮书》

《国防领域敏感性分析白皮书》

专知会员服务

9+阅读 · 6月25日

综述 | 从问答到任务完成：Agent系统与Harness设计

综述 | 从问答到任务完成：Agent系统与Harness设计

专知会员服务

10+阅读 · 6月24日

Agentic RL：框架、实践与长程智能体训练

Agentic RL：框架、实践与长程智能体训练

专知会员服务

10+阅读 · 6月24日

相关VIP内容

【ACL2022-华盛顿大学】生成知识促进常识推理，Generated Knowledge Prompting for Commonsense Reasoning

【ACL2022-华盛顿大学】生成知识促进常识推理，Generated Knowledge Prompting for Commonsense Reasoning

专知会员服务

26+阅读 · 2022年3月1日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

知识图谱嵌入模型的概率标定,Probability Calibration for Knowledge Graph Embedding Models

专知会员服务

36+阅读 · 2020年5月11日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日

【O’Reilly讲座】基于深度学习的异常检测方法用于检测大型数据集的质量：Anomaly detection using deep learning to measure the quality of large datasets

【O’Reilly讲座】基于深度学习的异常检测方法用于检测大型数据集的质量：Anomaly detection using deep learning to measure the quality of large datasets

专知会员服务

31+阅读 · 2020年1月11日

【ISWC2019教程】可扩展可持续知识图谱构建，251页ppt，Scalable construction of sustainable knowledge bases

【ISWC2019教程】可扩展可持续知识图谱构建，251页ppt，Scalable construction of sustainable knowledge bases

专知会员服务

47+阅读 · 2019年12月1日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

巡飞弹与反无人机系统——现代战场的两大支柱

《北约数字教官网络发展路径》128页报告

无人机自主控制与人工智能：系统性综述

《打造“黄金舰队”》57页报告

相关资讯

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

LibRec 精选：推荐系统的论文与源码

LibRec 精选：推荐系统的论文与源码

LibRec智能推荐

14+阅读 · 2018年11月29日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【论文推荐】最新六篇图像描述生成相关论文—视频摘要、注意力张量积、非自回归神经序列模型、副词识别、多主体、多样性度量

【论文推荐】最新六篇图像描述生成相关论文—视频摘要、注意力张量积、非自回归神经序列模型、副词识别、多主体、多样性度量

专知

10+阅读 · 2018年3月2日

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

专知

15+阅读 · 2018年2月3日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

相关论文

Solving NLP Problems through Human-System Collaboration: A Discussion-based Approach

Arxiv

0+阅读 · 2023年5月19日

Execution-Based Evaluation for Open-Domain Code Generation

Arxiv

0+阅读 · 2023年5月19日

Leveraging ChatGPT for Power System Programming Tasks

Arxiv

1+阅读 · 2023年5月18日

Differentiable Collision Detection for a Set of Convex Primitives

Arxiv

0+阅读 · 2023年5月18日

Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes

Arxiv

0+阅读 · 2023年5月17日

The GitHub Development Workflow Automation Ecosystems

Arxiv

0+阅读 · 2023年5月17日

Can Language Models Solve Graph Problems in Natural Language?

Arxiv

0+阅读 · 2023年5月17日

A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions

Arxiv

16+阅读 · 2023年2月9日

Deep Learning for Generic Object Detection: A Survey

Deep Learning for Generic Object Detection: A Survey

Arxiv

14+阅读 · 2018年9月6日

Approaches for Enriching and Improving Textual Knowledge Bases

Arxiv

15+阅读 · 2018年4月20日

相关基金

骨髓源性巨噬细胞microRNA-155对动脉粥样硬化的调控机制

国家自然科学基金

0+阅读 · 2016年12月31日

面向星载综合电子设备的智能BIT关键技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

MSCs来源exosomes调控TGF―β/Smad信号通路介导的EMT在修复受损子宫内膜中的作用及机制

国家自然科学基金

0+阅读 · 2014年12月31日

PTN调控小胶质细胞功能异质性对多发性硬化脑白质早期损害的修复机制

国家自然科学基金

0+阅读 · 2013年12月31日

面向协作生成服务的社交搜索研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于Dectin-1受体识别的酵母葡聚糖酶解片段的链结构及构效关系的研究

国家自然科学基金

0+阅读 · 2013年12月31日

高硫中低温下选择性还原NOx催化体系及原位机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

软件自动修复技术研究

国家自然科学基金

1+阅读 · 2012年12月31日

可信软件过程建模、分析、执行关键技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员