Adapting Large Language Models (LLMs) to downstream tasks using Reinforcement Learning (RL) has proven to be an effective approach. However, LLMs do not inherently define the structure of an agent for RL training, particularly in terms of defining the action space. This paper studies learning a compact latent action space to enhance the controllability and exploration of RL for LLMs. We propose Controlling Large Language Models with Latent Actions (CoLA), a framework that integrates a latent action space into pre-trained LLMs. We apply CoLA to the Llama-3.1-8B model. Our experiments demonstrate that, compared to RL with token-level actions, CoLA's latent action enables greater semantic diversity in text generation. For enhancing downstream tasks, we show that CoLA with RL achieves a score of 42.4 on the math500 benchmark, surpassing the baseline score of 38.2, and reaches 68.2 when augmented with a Monte Carlo Tree Search variant. Furthermore, CoLA with RL consistently improves performance on agent-based tasks without degrading the pre-trained LLM's capabilities, unlike the baseline. Finally, CoLA reduces computation time by half in tasks involving enhanced thinking prompts for LLMs by RL. These results highlight CoLA's potential to advance RL-based adaptation of LLMs for downstream applications.
翻译:利用强化学习(RL)使大型语言模型(LLM)适应下游任务已被证明是一种有效方法。然而,LLM本身并未为RL训练定义智能体的结构,特别是在定义动作空间方面。本文研究学习一个紧凑的潜在动作空间,以增强LLM在RL中的可控性和探索性。我们提出了基于潜在动作控制大型语言模型(CoLA)框架,该框架将潜在动作空间集成到预训练的LLM中。我们将CoLA应用于Llama-3.1-8B模型。实验表明,与采用词元级动作的RL相比,CoLA的潜在动作能够在文本生成中实现更大的语义多样性。在增强下游任务方面,CoLA结合RL在math500基准测试中取得了42.4分,超过了基线分数38.2,而在结合蒙特卡洛树搜索变体时达到了68.2分。此外,与基线方法不同,CoLA结合RL能持续提升基于智能体的任务性能,且不会降低预训练LLM的能力。最后,在涉及RL增强LLM思维提示的任务中,CoLA将计算时间减少了一半。这些结果凸显了CoLA在推动基于RL的LLM下游应用适配方面的潜力。