Reinforcement learning (RL) is a framework for solving sequential decision-making problems. In this work, we demonstrate that, surprisingly, RL emerges during the inference time of large language models (LLMs), a phenomenon we term in-context RL (ICRL). To reveal this capability, we introduce a simple multi-round prompting framework, we call ICRL prompting, for inference-time self-improvement. The goal of ICRL prompting is to guide LLMs to perform reinforcement learning during inference for self-improvement on a given task. After each response, the model receives numerical scalar feedback, denoted as a reward. In the next round, we prompt the LLM again together with a context that concatenates all prior responses and their associated rewards. We consistently observe that response quality improves as the context grows. In other words, the LLM can optimize scalar reward signals during inference, exhibiting behavior analogous to reinforcement learning. We evaluate ICRL prompting on Game of 24, creative writing, ScienceWorld, and Olympiad-level math competitions (AIME and HMMT), demonstrating significant improvements over baselines such as Self-Refine and Reflexion. Notably, even when the reward signals are generated by the same LLM, ICRL prompting still improves performance, highlighting a promising new paradigm for test-time scaling.
翻译:强化学习(RL)是解决序列决策问题的框架。本工作中,我们证明了在大语言模型(LLMs)的推理阶段会意外地涌现出强化学习能力,我们将这一现象称为上下文强化学习(ICRL)。为揭示此能力,我们引入了一个简单的多轮提示框架,称为ICRL提示,用于推理阶段的自改进。ICRL提示的目标是引导LLMs在推理过程中执行强化学习,以在给定任务上实现自我提升。每次生成响应后,模型会收到数值标量反馈,即奖励。在下一轮中,我们将所有先前的响应及其对应奖励拼接为上下文,再次提示LLM。我们一致观察到,随着上下文扩展,响应质量持续提升。换言之,LLM能够在推理过程中优化标量奖励信号,表现出与强化学习类似的行为。我们在24点游戏、创意写作、ScienceWorld以及奥林匹克级别数学竞赛(AIME和HMMT)上评估ICRL提示方法,结果显示其性能显著优于Self-Refine和Reflexion等基线方法。值得注意的是,即使奖励信号由同一LLM生成,ICRL提示仍能提升性能,这凸显了一种具有前景的测试时扩展新范式。