Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. In this work, we present Strategic Trajectory Abstraction (StraTA), a simple framework that introduces an explicit trajectory-level strategy into agentic reinforcement learning (RL). StraTA samples a compact strategy from the initial task state, conditions subsequent actions on that strategy, and trains strategy generation and action execution jointly with a hierarchical GRPO-style rollout design, further enhanced by diverse strategy rollout and critical self-judgment. Experiments on ALFWorld, WebShop, and SciWorld show that StraTA consistently improves both sample efficiency and final performance over strong baselines. StraTA reaches success rates of 93.1% on ALFWorld and 84.2% on WebShop. On SciWorld, StraTA attains a 63.5% overall score, outperforming frontier closed-source models.
翻译:大语言模型(LLM)正越来越多地被用作交互式智能体,但针对长程决策任务优化它们仍然困难重重,因为当前方法大多纯粹是反应式的,这削弱了在长序列轨迹上的探索与信用分配。在本文中,我们提出策略轨迹抽象(StraTA),这是一个简洁的框架,它将显式的轨迹级策略引入智能体强化学习。StraTA从初始任务状态中采样一个紧凑策略,使后续动作条件化于此策略,并通过分层GRPO式 rollout设计联合训练策略生成与动作执行,再辅以多样化策略采样与关键自我评判机制加以增强。在ALFWorld、WebShop和SciWorld上的实验表明,相比强基线,StraTA在样本效率与最终性能上均取得了一致提升。StraTA在ALFWorld上达到93.1%的成功率,在WebShop上达到84.2%。在SciWorld上,StraTA取得了63.5%的整体得分,超越了前沿闭源模型。