Spark：基于关键状态动态分支的战略策略感知探索用于长视野智能体学习 (Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning)

Reinforcement learning has empowered large language models to act as intelligent agents, yet training them for long-horizon tasks remains challenging due to the scarcity of high-quality trajectories, especially under limited resources. Existing methods typically scale up rollout sizes and indiscriminately allocate computational resources among intermediate steps. Such attempts inherently waste substantial computation budget on trivial steps while failing to guarantee sample quality. To address this, we propose \textbf{Spark} (\textbf{S}trategic \textbf{P}olicy-\textbf{A}ware explo\textbf{R}ation via \textbf{K}ey-state dynamic branching), a novel framework that selectively branches at critical decision states for resource-efficient exploration. Our key insight is to activate adaptive branching exploration at critical decision points to probe promising trajectories, thereby achieving precise resource allocation that prioritizes sampling quality over blind coverage. This design leverages the agent's intrinsic decision-making signals to reduce dependence on human priors, enabling the agent to autonomously expand exploration and achieve stronger generalization. Experiments across diverse tasks (e.g., embodied planning), demonstrate that \textsc{Spark} achieves superior success rates with significantly fewer training samples, exhibiting robust generalization even in unseen scenarios.

翻译：强化学习已赋能大型语言模型作为智能体，但在有限资源下，长视野任务的训练仍面临高质量轨迹稀缺的挑战。现有方法通常通过扩大采样规模并在中间步骤中无差别分配计算资源来解决此问题。这类尝试本质上在平凡步骤上浪费了大量计算资源，且无法保证样本质量。为此，我们提出\textbf{Spark}（基于关键状态动态分支的战略策略感知探索），一种新颖的框架，通过在关键决策状态选择性分支以实现资源高效的探索。我们的核心思路是在关键决策点激活自适应分支探索以探测有潜力的轨迹，从而实现优先考虑采样质量而非盲目覆盖的精确资源分配。该设计利用智能体固有的决策信号减少对人类先验知识的依赖，使智能体能够自主扩展探索并实现更强的泛化能力。在多样化任务（例如具身规划）上的实验表明，\textsc{Spark} 能以显著更少的训练样本获得更高的成功率，并在未见场景中展现出稳健的泛化性能。

相关内容

Spark

关注 51

Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架，Spark，拥有Hadoop MapReduce所具有的优点；但不同于MapReduce的是Job中间输出结果可以保存在内存中，从而不再需要读写HDFS，因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。

《基于分层多智能体强化学习的逼真空战协同策略》

专知会员服务

39+阅读 · 2025年10月30日

《具备集体态势感知能力的深度强化学习智能体在超视距空战中的应用研究》最新文献

专知会员服务

43+阅读 · 2025年9月23日

【博士论文】大规模人工智能中的强化学习智能体：高效训练与更严谨分析

专知会员服务

16+阅读 · 2025年7月1日

基于学习机制的多智能体强化学习综述

专知会员服务

61+阅读 · 2024年4月16日