Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant potential across various domains, including mitigating harm in LLM outputs, enhancing text summarization, and mathematical reasoning. This paper introduces an RLAIF framework for improving the code generation abilities of lightweight (<1B parameters) LLMs. We specifically focus on code generation tasks that require writing appropriate API calls, which is challenging due to the well-known issue of hallucination in LLMs. Our framework extracts AI feedback from a larger LLM (e.g., GPT-3.5) through a specialized prompting strategy and uses this data to train a reward model towards better alignment from smaller LLMs. We run our experiments on the Gorilla dataset and meticulously assess the quality of the model-generated code across various metrics, including AST, ROUGE, and Code-BLEU, and develop a pipeline to compute its executability rate accurately. Our approach significantly enhances the fine-tuned LLM baseline's performance, achieving a 4.5% improvement in executability rate. Notably, a smaller LLM model (780M parameters) trained with RLAIF surpasses a much larger fine-tuned baseline with 7B parameters, achieving a 1.0% higher code executability rate.
翻译:人工智能反馈强化学习(RLAIF)已在多个领域展现出显著潜力,包括减轻LLM输出的危害性、提升文本摘要质量以及增强数学推理能力。本文提出一种RLAIF框架,旨在提升轻量级(参数规模<10亿)LLM的代码生成能力。我们特别聚焦于需要编写正确API调用的代码生成任务,该任务因LLM普遍存在的幻觉问题而极具挑战性。本框架通过专门设计的提示策略从大型LLM(如GPT-3.5)中提取AI反馈数据,并利用这些数据训练奖励模型,以促进小型LLM的更好对齐。我们在Gorilla数据集上进行实验,通过AST、ROUGE和Code-BLEU等多维度指标系统评估模型生成代码的质量,并构建了精确计算代码可执行率的评估流程。该方法显著提升了微调LLM基线的性能,使代码可执行率提高了4.5%。值得注意的是,采用RLAIF训练的较小规模LLM模型(7.8亿参数)在代码可执行率上超越了参数规模大9倍的微调基线模型(70亿参数),实现了1.0%的性能优势。