Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using task-specific feedback. This improves rollout quality and stabilizes learning while leaving the underlying optimization objective unchanged, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a one-time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a simple and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.
翻译:大语言模型(LLM)的进展正推动着使用强化学习(RL)来训练智能体,使其能够跨任务进行迭代式多轮交互。然而,多轮强化学习仍然面临挑战,因为奖励通常是稀疏或延迟的,且环境可能具有随机性。在此情境下,简单的轨迹采样会阻碍有效利用并导致模式崩溃。我们提出了TSR(轨迹搜索展开),这是一种训练时方法,它重新利用了测试时扩展的思想来改进每轮展开的生成。TSR执行轻量级的树状搜索,通过使用特定于任务的反馈在每一轮选择高得分动作来构建高质量轨迹。这提高了展开质量并稳定了学习过程,同时保持底层的优化目标不变,使得TSR与优化器无关。我们使用最佳N选择、束搜索和浅层前瞻搜索来实例化TSR,并将其与PPO和GRPO配对,在Sokoban、FrozenLake和WebShop任务上实现了高达15%的性能提升和更稳定的学习,而训练计算量仅一次性增加。通过将搜索从推理时移至训练的展开阶段,TSR为更强的多轮智能体学习提供了一个简单而通用的机制,是对现有框架和拒绝采样式选择方法的有力补充。