LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Large language models (LLMs) provide excellent text-generation capabilities, but standard prompting and generation methods generally do not lead to intentional or goal-directed agents and might necessitate considerable prompt tuning. This becomes particularly apparent in multi-turn conversations: even the best current LLMs rarely ask clarifying questions, engage in explicit information gathering, or take actions now that lead to better decisions after multiple turns. Reinforcement learning has the potential to leverage the powerful modeling capabilities of LLMs, as well as their internal representation of textual interactions, to create capable goal-directed language agents. This can enable intentional and temporally extended interactions, such as with humans, through coordinated persuasion and carefully crafted questions, or in goal-directed play through text games to bring about desired final outcomes. However, enabling this requires the community to develop stable and reliable reinforcement learning algorithms that can effectively train LLMs. Developing such algorithms requires tasks that can gauge progress on algorithm design, provide accessible and reproducible evaluations for multi-turn interactions, and cover a range of task properties and challenges in improving reinforcement learning algorithms. Our paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for LLMs, together with an open-source research framework containing a basic toolkit for getting started on multi-turn RL with offline value-based and policy-based RL methods. Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.

翻译：大型语言模型（LLMs）具备出色的文本生成能力，但标准提示和生成方法通常无法构建具有意图性或目标导向的智能体，且可能需要大量的提示调优。这一问题在多轮对话中尤为突出：即便是当前最优的大语言模型，也鲜少主动提出澄清性问题、进行显式信息收集，或采取能经过多轮交互后优化决策的行动。强化学习有望利用LLMs强大的建模能力及其对文本交互的内部表征，构建具备目标导向能力的语言智能体。这可通过协调性说服、精心设计的问题与人类进行具有意图性和时间延展性的交互，或在文本游戏中通过目标导向操作达成预期最终结果。然而，实现这一目标需要学术界开发稳定可靠的强化学习算法以有效训练LLMs。此类算法的研发需要能够衡量算法设计进展、提供可访问且可复现的多轮交互评估，并涵盖强化学习算法改进中各类任务特性与挑战的基准测试。本文提出了用于评估LLMs多轮强化学习的LMRL-Gym基准，并配套开源研究框架，包含基于离线值函数和策略的强化学习方法的入门工具包。该基准涵盖8项不同语言任务，均需多轮语言交互，涉及开放式对话与文本游戏中的多种任务场景。