Reinforcement learning from human feedback (RLHF) is fundamentally limited by the capacity of humans to correctly evaluate model output. To improve human evaluation ability and overcome that limitation this work trains "critic" models that help humans to more accurately evaluate model-written code. These critics are themselves LLMs trained with RLHF to write natural language feedback highlighting problems in code from real-world assistant tasks. On code containing naturally occurring LLM errors model-written critiques are preferred over human critiques in 63% of cases, and human evaluation finds that models catch more bugs than human contractors paid for code review. We further confirm that our fine-tuned LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as "flawless", even though the majority of those tasks are non-code tasks and thus out-of-distribution for the critic model. Critics can have limitations of their own, including hallucinated bugs that could mislead humans into making mistakes they might have otherwise avoided, but human-machine teams of critics and contractors catch similar numbers of bugs to LLM critics while hallucinating less than LLMs alone.
翻译:基于人类反馈的强化学习(RLHF)从根本上受限于人类正确评估模型输出的能力。为提升人类评估能力并突破该限制,本研究训练了帮助人类更准确评估模型生成代码的"批评者"模型。这些批评者本身是通过RLHF训练的LLM,其功能是撰写自然语言反馈以指出现实世界辅助任务中代码存在的问题。在包含自然发生的LLM错误的代码上,模型撰写的批评在63%的情况下优于人类批评,且人工评估发现模型比付费代码审查人员能发现更多缺陷。我们进一步证实,经过微调的LLM批评者能够成功识别ChatGPT训练数据中被标记为"无缺陷"的数百个错误,尽管其中大多数任务为非代码任务,因而超出批评者模型的训练分布范围。批评者模型自身也存在局限,包括可能误导人类犯下本可避免错误的幻觉缺陷,但由批评者模型与人工审查员组成的人机协作团队,在保持较低幻觉率的同时,能达到与单独使用LLM批评者相当的缺陷捕捉数量。