Large language models for code are advancing fast, yet our ability to evaluate them lags behind. Current benchmarks focus on narrow tasks and single metrics, which hide critical gaps in robustness, interpretability, fairness, efficiency, and real-world usability. They also suffer from inconsistent data engineering practices, limited software engineering context, and widespread contamination issues. To understand these problems and chart a path forward, we combined an in-depth survey of existing benchmarks with insights gathered from a dedicated community workshop. We identified three core barriers to reliable evaluation: the absence of software-engineering-rich datasets, overreliance on ML-centric metrics, and the lack of standardized, reproducible data pipelines. Building on these findings, we introduce BEHELM, a holistic benchmarking infrastructure that unifies software-scenario specification with multi-metric evaluation. BEHELM provides a structured way to assess models across tasks, languages, input and output granularities, and key quality dimensions. Our goal is to reduce the overhead currently required to construct benchmarks while enabling a fair, realistic, and future-proof assessment of LLMs in software engineering.
翻译:代码专用大型语言模型发展迅速,但我们的评估能力却相对滞后。现有基准测试聚焦于狭窄任务和单一指标,掩盖了模型在鲁棒性、可解释性、公平性、效率及实际可用性方面的关键缺陷。这些基准测试还存在数据工程实践不一致、软件工程上下文信息有限以及普遍的数据污染问题。为深入剖析这些问题并规划未来发展路径,我们结合了对现有基准测试的深度调研与专题社区研讨会收集的洞见。我们识别出可靠评估面临的三大核心障碍:缺乏富含软件工程要素的数据集、过度依赖以机器学习为中心的指标,以及标准化可复现数据管道的缺失。基于这些发现,我们提出了BEHELM——一个将软件场景规约与多指标评估相统一的整体性基准测试基础设施。BEHELM提供了结构化方法,可从任务类型、编程语言、输入输出粒度及关键质量维度等多个层面系统评估模型。我们的目标是降低当前构建基准测试所需的高昂开销,同时实现对软件工程领域大型语言模型进行公平、现实且面向未来的评估。