Reinforcement learning has empowered large language models to act as intelligent agents, yet training them for long-horizon tasks remains challenging due to the scarcity of high-quality trajectories, especially under limited resources. Existing methods typically scale up rollout sizes and indiscriminately allocate computational resources among intermediate steps. Such attempts inherently waste substantial computation budget on trivial steps while failing to guarantee sample quality. To address this, we propose \textbf{Spark} (\textbf{S}trategic \textbf{P}olicy-\textbf{A}ware explo\textbf{R}ation via \textbf{K}ey-state dynamic branching), a novel framework that selectively branches at critical decision states for resource-efficient exploration. Our key insight is to activate adaptive branching exploration at critical decision points to probe promising trajectories, thereby achieving precise resource allocation that prioritizes sampling quality over blind coverage. This design leverages the agent's intrinsic decision-making signals to reduce dependence on human priors, enabling the agent to autonomously expand exploration and achieve stronger generalization. Experiments across diverse tasks (e.g., embodied planning), demonstrate that \textsc{Spark} achieves superior success rates with significantly fewer training samples, exhibiting robust generalization even in unseen scenarios.
翻译:强化学习已赋能大型语言模型作为智能体,但在有限资源下,长视野任务的训练仍面临高质量轨迹稀缺的挑战。现有方法通常通过扩大采样规模并在中间步骤中无差别分配计算资源来解决此问题。这类尝试本质上在平凡步骤上浪费了大量计算资源,且无法保证样本质量。为此,我们提出\textbf{Spark}(基于关键状态动态分支的战略策略感知探索),一种新颖的框架,通过在关键决策状态选择性分支以实现资源高效的探索。我们的核心思路是在关键决策点激活自适应分支探索以探测有潜力的轨迹,从而实现优先考虑采样质量而非盲目覆盖的精确资源分配。该设计利用智能体固有的决策信号减少对人类先验知识的依赖,使智能体能够自主扩展探索并实现更强的泛化能力。在多样化任务(例如具身规划)上的实验表明,\textsc{Spark} 能以显著更少的训练样本获得更高的成功率,并在未见场景中展现出稳健的泛化性能。