CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences

Evaluating the alignment of large language models (LLMs) with user-defined coding preferences is a challenging endeavour that requires a deep assessment of LLMs' outputs. Existing methods and benchmarks rely primarily on automated metrics and static analysis tools, which often fail to capture the nuances of user instructions and LLM outputs. To address this gap, we propose using the LLM-as-a-Judge methodology to evaluate the alignment of LLMs with coding preferences. Based on this approach, we present CodeUltraFeedback, a comprehensive dataset designed to facilitate the evaluation and improvement of LLM alignment. CodeUltraFeedback consists of 10,000 coding instructions, each annotated with four responses generated from a diverse pool of 14 LLMs. These responses are ranked based on five distinct coding preferences using GPT-3.5 as a judge, providing both numerical scores and detailed textual feedback. Our analysis of CodeUltraFeedback reveals that responses from GPT-3.5 and GPT-4 are generally preferred over those from open-weight LLMs, highlighting significant differences in alignment between closed and open-weight models. In turn, we explore the usage of CodeUltraFeedback as feedback data to fine-tune and align CodeLlama-7B-Instruct using supervised fine-tuning (SFT) and reinforcement learning from AI feedback (RLAIF) with direct preference optimization (DPO). The resulting aligned CodeLlama-7B-Instruct model outperforms larger LLMs in terms of alignment with coding preferences and shows improved functional correctness on the HumanEval+ benchmark compared to the original instruct model. Therefore, our contributions bridge the gap in preference tuning of LLMs for code and set the stage for further advancements in model alignment and RLAIF in automated software engineering.

翻译：评估大型语言模型（LLMs）与用户定义编码偏好之间的对齐是一项具有挑战性的任务，需要对LLMs的输出进行深入评估。现有方法和基准主要依赖自动化指标和静态分析工具，这些工具往往无法捕捉用户指令和LLM输出的细微差别。为弥补这一不足，我们提出采用LLM-as-a-Judge方法来评估LLMs与编码偏好的对齐程度。基于此方法，我们提出了CodeUltraFeedback——一个旨在促进LLM对齐评估与改进的综合数据集。CodeUltraFeedback包含10,000条编码指令，每条指令均标注有从14个不同LLMs生成的四个响应。这些响应使用GPT-3.5作为评判者，依据五种不同的编码偏好进行排序，并提供数值评分和详细的文本反馈。我们对CodeUltraFeedback的分析表明，GPT-3.5和GPT-4生成的响应通常优于开源权重LLMs的响应，这凸显了闭源模型与开源权重模型在对齐性上的显著差异。进一步地，我们探索了将CodeUltraFeedback作为反馈数据，通过监督微调（SFT）和基于AI反馈的强化学习（RLAIF）结合直接偏好优化（DPO），对CodeLlama-7B-Instruct进行微调和对齐。最终得到的对齐后CodeLlama-7B-Instruct模型在编码偏好对齐方面优于更大的LLMs，并且在HumanEval+基准测试中相比原始指令模型展现出更强的功能正确性。因此，我们的工作弥补了代码领域LLM偏好调优的空白，并为自动化软件工程中的模型对齐和RLAIF技术的进一步发展奠定了基础。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日