Reinforcement Learning with Human Feedback (RLHF) has been demonstrated to significantly enhance the performance of large language models (LLMs) by aligning their outputs with desired human values. However, RLHF is constrained by the expertise and productivity limitations of human evaluators. In this study, we investigate an alternative approach: Reinforcement Learning with Generative Adversarial Feedback (RLGAF) to RLHF. Our preliminary findings indicate that RLGAF can help align LLMs outputs while not suffering from the inherent restrictions of RLHF, suggesting promising avenues for further research on automating AI alignment.
翻译:强化学习与人类反馈(RLHF)已被证明能够通过将大型语言模型(LLM)的输出与期望的人类价值观对齐,从而显著提升其性能。然而,RLHF受限于人类评估者的专业水平和生产效率。在本研究中,我们探索了一种替代方法:基于生成对抗反馈的强化学习(RLGAF)替代RLHF。初步结果表明,RLGAF能够帮助对齐LLM的输出,同时避免RLHF固有的局限性,这为自动化人工智能对齐的进一步研究提供了有前景的方向。