Large Language Models (LLMs) have demonstrated remarkable capabilities in software engineering, yet comprehensive benchmarks covering diverse SE activities remain limited. We present a multi-task evaluation of 11 state-of-the-art LLMs across five representative software engineering tasks: bug fixing, feature development, code refactoring, technical copywriting, and research synthesis. Our automated verification framework measures both output quality and completion efficiency. Key findings reveal that (1) models achieving identical perfect scores exhibit 22x variation in completion time, 49x variation in tool efficiency, and 53x variation in estimated cost; (2) tool usage frequency shows no correlation with success (r = 0.077, p = 0.575) - one model used 917 tool calls while another solved the same task with 3 calls; (3) we identify two distinct inefficiency patterns: loop inefficiency and inference inefficiency; and (4) coding tasks achieve 100 percent success while research tasks present greater challenges (90.9 percent). We release all experimental data, verification scripts, and analysis code for full reproducibility.
翻译:大语言模型在软件工程领域展现出卓越能力,然而覆盖多样化软件工程活动的综合性基准测试仍然有限。本文对11个前沿大语言模型在五项代表性软件工程任务上进行了多任务评估:缺陷修复、功能开发、代码重构、技术文档撰写和研究综述。我们开发的自动化验证框架同时衡量输出质量与完成效率。关键发现表明:(1) 获得相同满分表现的模型在完成时间上存在22倍差异,工具使用效率存在49倍差异,预估成本存在53倍差异;(2) 工具使用频率与成功率无相关性(r = 0.077, p = 0.575)——某个模型调用917次工具,而另一个模型仅用3次调用即完成相同任务;(3) 我们识别出两种低效模式:循环低效与推理低效;(4) 编码任务实现100%成功率,而研究类任务面临更大挑战(90.9%成功率)。我们公开全部实验数据、验证脚本与分析代码以确保完全可复现性。