CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences

Evaluating the alignment of large language models (LLMs) with user-defined coding preferences is a challenging endeavour that requires assessing intricate textual LLMs' outputs. By relying on automated metrics and static analysis tools, existing benchmarks fail to assess nuances in user instructions and LLM outputs, highlighting the need for large-scale datasets and benchmarks for LLM preference alignment. In this paper, we introduce CodeUltraFeedback, a preference dataset of 10,000 complex instructions to tune and align LLMs to coding preferences through AI feedback. We generate responses to the instructions using a pool of 14 diverse LLMs, which we then annotate according to their alignment with five coding preferences using the LLM-as-a-Judge approach with GPT-3.5, producing both numerical and textual feedback. We also present CODAL-Bench, a benchmark for assessing LLM alignment with these coding preferences. Our results show that CodeLlama-7B-Instruct, aligned through reinforcement learning from AI feedback (RLAIF) with direct preference optimization (DPO) using CodeUltraFeedback's AI feedback data, outperforms 34B LLMs on CODAL-Bench, validating the utility of CodeUltraFeedback for preference tuning. Furthermore, we show our DPO-aligned CodeLlama model improves functional correctness on HumanEval+ compared to the unaligned base model. Therefore, our contributions bridge the gap in preference tuning of LLMs for code and set the stage for further advancements in model alignment and RLAIF for code intelligence. Our code and data are available at https://github.com/martin-wey/CodeUltraFeedback.

翻译：评估大型语言模型（LLM）与用户定义的编码偏好对齐程度是一项具有挑战性的工作，需要评估LLM输出的复杂文本细节。现有基准依赖自动化指标和静态分析工具，难以评估用户指令与LLM输出中的细微差别，凸显了对大规模数据集和基准进行LLM偏好对齐的需求。本文提出CodeUltraFeedback——一个包含10,000条复杂指令的偏好数据集，通过人工智能反馈来调优和对齐LLM的编码偏好。我们使用14个不同LLM组成的模型池生成指令响应，随后采用基于GPT-3.5的LLM-as-a-Judge方法，根据五种编码偏好对其对齐程度进行标注，同时生成数值和文本反馈。我们还提出了CODAL-Bench基准，用于评估LLM与这些编码偏好的对齐情况。实验结果表明，通过CodeUltraFeedback的人工智能反馈数据，采用基于人工智能反馈的强化学习（RLAIF）与直接偏好优化（DPO）对齐后的CodeLlama-7B-Instruct模型，在CODAL-Bench上优于340亿参数的LLM，验证了CodeUltraFeedback在偏好调优中的有效性。此外，我们的DPO对齐版CodeLlama模型在HumanEval+上的功能正确性相比未对齐的基模型有所提升。因此，我们的工作弥合了代码领域LLM偏好调优的差距，为模型对齐和代码智能的RLAIF进一步发展奠定基础。代码和数据详见https://github.com/martin-wey/CodeUltraFeedback。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日