Large language models (LLMs) have been increasingly deployed in real-world software engineering, fostering the development of code evaluation metrics to study the quality of LLM-generated code. Conventional rule-based metrics merely score programs based on their surface-level similarities with reference programs instead of analyzing functionality and code quality in depth. To address this limitation, researchers have developed LLM-as-a-judge metrics, prompting LLMs to evaluate and score code, and curated various code evaluation benchmarks to validate their effectiveness. However, these benchmarks suffer from critical limitations, hindering reliable assessments of evaluation capability: Some feature coarse-grained binary labels, which reduce rich code behavior to a single bit of information, obscuring subtle errors. Others propose fine-grained but subjective, vaguely-defined evaluation criteria, introducing unreliability in manually-annotated scores, which is the ground-truth they rely on. Furthermore, they often use uncontrolled data synthesis methods, leading to unbalanced score distributions that poorly represent real-world code generation scenarios. To curate a diverse benchmark with programs of well-balanced distributions across various quality levels and streamline the manual annotation procedure, we propose AXIOM, a novel perturbation-based framework for synthesizing code evaluation benchmarks at scale. It reframes program scores as the refinement effort needed for deployment, consisting of two stages: (1) Rule-guided perturbation, which prompts LLMs to apply sequences of predefined perturbation rules to existing high-quality programs to modify their functionality and code quality, enabling us to precisely control each program's target score to achieve balanced score distributions. (2) Multisource quality calibration, which first selects a subset of...
翻译:大型语言模型(LLMs)在现实世界软件工程中的部署日益增多,推动了代码评估指标的发展,以研究LLM生成代码的质量。传统的基于规则的指标仅根据程序与参考程序在表层上的相似性进行评分,而非深入分析功能性与代码质量。为应对这一局限,研究者开发了LLM-as-a-judge指标,通过提示LLMs评估并评分代码,并构建了多种代码评估基准以验证其有效性。然而,这些基准存在关键缺陷,阻碍了评估能力的可靠衡量:部分基准采用粗粒度的二元标签,将丰富的代码行为简化为单一比特信息,掩盖了细微错误;另一些则提出了细粒度但主观、定义模糊的评估标准,导致其依赖的人工标注分数不可靠。此外,它们常使用不受控的数据合成方法,导致分数分布不平衡,难以真实反映现实世界的代码生成场景。为构建一个包含各质量水平均衡分布程序的多样化基准,并简化人工标注流程,我们提出了AXIOM——一种基于扰动的新型框架,用于大规模合成代码评估基准。该框架将程序分数重新定义为部署所需的改进工作量,包含两个阶段:(1)规则引导的扰动:提示LLMs对现有高质量程序应用预定义的扰动规则序列,以修改其功能性与代码质量,从而精确控制每个程序的目标分数,实现均衡的分数分布。(2)多源质量校准:首先选取一个子集...