Large Language Models (LLMs) often struggle when prompted to generate content under specific constraints. However, in such cases it is often easy to check whether these constraints are satisfied or violated. Recent works have shown that LLMs can benefit from such "corrective feedback". Here we claim that this skill of LLMs can be significantly enhanced via training. We introduce an RL framework for teaching models to use such rewards, by simulating interaction sessions, and rewarding the model according to its ability to satisfy the constraints. We refer to our method as CORGI (Controlled Generation with RL for Guided Interaction), and evaluate it on a variety of controlled generation tasks using unlabeled training data. We find that CORGI consistently outperforms the baseline reinforcement learning method that does not incorporate conversational feedback. Furthermore, CORGI's interactive framework enables meta-learning, allowing the LLM to generalize better to guided interaction in new tasks. Our results clearly show that conversational optimization, when combined with reinforcement learning, significantly improves the effectiveness of LLMs in controlled generation contexts.
翻译:大型语言模型(LLM)在需要生成满足特定约束条件的内容时常常表现不佳。然而,此类情况下通常易于检验约束条件是否得到满足。近期研究表明,LLM能够从这类“纠正性反馈”中获益。本文主张,通过训练可显著增强LLM运用此类反馈的能力。我们提出了一种强化学习框架,通过模拟交互会话并依据模型满足约束的能力给予奖励,从而训练模型利用此类反馈机制。我们将该方法命名为CORGI(基于强化学习的引导式交互控制生成),并在无标注训练数据上针对多种受控生成任务进行评估。实验表明,CORGI在各项任务中均持续优于未融合对话反馈的基线强化学习方法。此外,CORGI的交互框架支持元学习能力,使LLM能够更好地泛化至新任务的引导式交互场景。我们的研究结果明确显示,对话优化与强化学习相结合,能显著提升LLM在受控生成场景中的效能。