OmniCode: A Benchmark for Evaluating Software Engineering Agents

Atharv Sonwane,Eng-Shen Tu,Wei-Chung Lu,Claas Beger,Carter Larsen,Debjit Dhar,Simon Alford,Rachel Chen,Ronit Pattanayak,Tuan Anh Dang,Guohao Chen,Gloria Geng,Kevin Ellis,Saikat Dutta

LLM-powered coding agents are redefining how real-world software is developed. To drive the research towards better coding agents, we require challenging benchmarks that can rigorously evaluate the ability of such agents to perform various software engineering tasks. However, popular coding benchmarks such as HumanEval and SWE-Bench focus on narrowly scoped tasks such as competition programming and patch generation. In reality, software engineers have to handle a broader set of tasks for real-world software development. To address this gap, we propose OmniCode, a novel software engineering benchmark that contains a broader and more diverse set of task categories beyond code or patch generation. Overall, OmniCode contains 1794 tasks spanning three programming languages (Python, Java, and C++) and four key categories: bug fixing, test generation, code review fixing, and style fixing. In contrast to prior software engineering benchmarks, the tasks in OmniCode are (1) manually validated to eliminate ill-defined problems, and (2) synthetically crafted or recently curated to avoid data leakage issues, presenting a new framework for synthetically generating diverse software tasks from limited real-world data. We evaluate OmniCode with popular agent frameworks such as SWE-Agent and show that while they may perform well on bug fixing for Python, they fall short on tasks such as Test Generation and in languages such as C++ and Java. For instance, SWE-Agent achieves a maximum of 20.9% with DeepSeek-V3.1 on Java Test Generation tasks. OmniCode aims to serve as a robust benchmark and spur the development of agents that can perform well across different aspects of software development. Code and data are available at https://github.com/seal-research/OmniCode.

翻译：基于大语言模型的编程智能体正在重新定义现实世界软件的开发方式。为推进更优编程智能体的研究，我们需要具有挑战性的基准来严格评估此类智能体执行各类软件工程任务的能力。然而，当前流行的编程基准（如HumanEval和SWE-Bench）主要聚焦于竞赛编程和补丁生成等范围狭窄的任务。现实中，软件工程师必须处理更广泛的任务以支持实际软件开发。为弥补这一差距，我们提出了OmniCode——一个新颖的软件工程基准，其包含超越代码或补丁生成的更广泛、更多样化的任务类别。总体而言，OmniCode涵盖1794个任务，涉及三种编程语言（Python、Java和C++）及四个关键类别：缺陷修复、测试生成、代码审查修复和风格修复。与先前的软件工程基准相比，OmniCode中的任务具有以下特点：（1）经过人工验证以消除定义不清的问题；（2）通过合成构建或近期整理以避免数据泄露问题，这为从有限真实数据中合成生成多样化软件任务提供了新框架。我们使用SWE-Agent等主流智能体框架对OmniCode进行评估，结果表明：尽管这些框架在Python缺陷修复任务上表现良好，但在测试生成等任务以及C++和Java语言任务上仍存在不足。例如，SWE-Agent在Java测试生成任务上使用DeepSeek-V3.1的最高成功率仅为20.9%。OmniCode旨在成为一个稳健的基准，并推动能够跨软件开发不同方面均表现优异的智能体的发展。代码与数据公开于https://github.com/seal-research/OmniCode。