Large Language Models (LLMs) have recently demonstrated remarkable coding capabilities. However, assessing code generation based on well-formed properties and aligning it with developer preferences remains challenging. In this paper, we explore two key questions under the new challenge of code preference learning: (i) How do we train models to predict meaningful preferences for code? and (ii) How do human and LLM preferences align with verifiable code properties and developer code tastes? To this end, we propose CodeFavor, a framework for training pairwise code preference models from synthetic evolution data, including code commits and code critiques. To evaluate code preferences, we introduce CodePrefBench, a benchmark comprising 1364 rigorously curated code preference tasks to cover three verifiable properties-correctness, efficiency, and security-along with human preference. Our evaluation shows that CodeFavor holistically improves the accuracy of model-based code preferences by up to 28.8%. Meanwhile, CodeFavor models can match the performance of models with 6-9x more parameters while being 34x more cost-effective. We also rigorously validate the design choices in CodeFavor via a comprehensive set of controlled experiments. Furthermore, we discover the prohibitive costs and limitations of human-based code preference: despite spending 23.4 person-minutes on each task, 15.1-40.3% of tasks remain unsolved. Compared to model-based preference, human preference tends to be more accurate under the objective of code correctness, while being sub-optimal for non-functional objectives.
翻译:大型语言模型(LLMs)近期展现出卓越的代码生成能力。然而,基于代码的规范性属性评估生成结果,并将其与开发者偏好对齐,仍具挑战性。本文在代码偏好学习这一新挑战下探讨两个核心问题:(i)如何训练模型以预测有意义的代码偏好?(ii)人类与LLM的偏好如何与可验证的代码属性及开发者代码品味相契合?为此,我们提出CodeFavor框架,该框架利用包含代码提交与代码评审的合成演化数据训练成对代码偏好模型。为评估代码偏好,我们构建了CodePrefBench基准数据集,涵盖1364项经过严格筛选的代码偏好任务,覆盖正确性、效率、安全性三种可验证属性及人类偏好。实验表明,CodeFavor将基于模型的代码偏好预测准确率最高提升28.8%。同时,CodeFavor模型性能可媲美参数量6-9倍的模型,而成本效益提升34倍。我们通过系列对照实验系统验证了CodeFavor的设计选择。此外,研究发现人类代码偏好标注存在高昂成本与局限性:尽管每个任务平均耗费23.4人分钟,仍有15.1-40.3%的任务无法解决。相较于模型偏好,人类偏好在代码正确性目标上更准确,但在非功能性目标上表现欠佳。