Evaluating the alignment of large language models (LLMs) with user-defined coding preferences is a challenging endeavour that requires assessing intricate textual LLMs' outputs. By relying on automated metrics and static analysis tools, existing benchmarks fail to assess nuances in user instructions and LLM outputs, highlighting the need for large-scale datasets and benchmarks for LLM preference alignment. In this paper, we introduce CodeUltraFeedback, a preference dataset of 10,000 complex instructions to tune and align LLMs to coding preferences through AI feedback. We generate responses to the instructions using a pool of 14 diverse LLMs, which we then annotate according to their alignment with five coding preferences using the LLM-as-a-Judge approach with GPT-3.5, producing both numerical and textual feedback. We also present CODAL-Bench, a benchmark for assessing LLM alignment with these coding preferences. Our results show that CodeLlama-7B-Instruct, aligned through reinforcement learning from AI feedback (RLAIF) with direct preference optimization (DPO) using CodeUltraFeedback's AI feedback data, outperforms 34B LLMs on CODAL-Bench, validating the utility of CodeUltraFeedback for preference tuning. Furthermore, we show our DPO-aligned CodeLlama model improves functional correctness on HumanEval+ compared to the unaligned base model. Therefore, our contributions bridge the gap in preference tuning of LLMs for code and set the stage for further advancements in model alignment and RLAIF for code intelligence. Our code and data are available at https://github.com/martin-wey/CodeUltraFeedback.
翻译:评估大型语言模型(LLM)与用户定义的编码偏好对齐程度是一项具有挑战性的工作,需要评估LLM输出的复杂文本细节。现有基准依赖自动化指标和静态分析工具,难以评估用户指令与LLM输出中的细微差别,凸显了对大规模数据集和基准进行LLM偏好对齐的需求。本文提出CodeUltraFeedback——一个包含10,000条复杂指令的偏好数据集,通过人工智能反馈来调优和对齐LLM的编码偏好。我们使用14个不同LLM组成的模型池生成指令响应,随后采用基于GPT-3.5的LLM-as-a-Judge方法,根据五种编码偏好对其对齐程度进行标注,同时生成数值和文本反馈。我们还提出了CODAL-Bench基准,用于评估LLM与这些编码偏好的对齐情况。实验结果表明,通过CodeUltraFeedback的人工智能反馈数据,采用基于人工智能反馈的强化学习(RLAIF)与直接偏好优化(DPO)对齐后的CodeLlama-7B-Instruct模型,在CODAL-Bench上优于340亿参数的LLM,验证了CodeUltraFeedback在偏好调优中的有效性。此外,我们的DPO对齐版CodeLlama模型在HumanEval+上的功能正确性相比未对齐的基模型有所提升。因此,我们的工作弥合了代码领域LLM偏好调优的差距,为模型对齐和代码智能的RLAIF进一步发展奠定基础。代码和数据详见https://github.com/martin-wey/CodeUltraFeedback。