Offline goal-conditioned reinforcement learning remains challenging for long-horizon tasks. While hierarchical approaches mitigate this issue by decomposing tasks, most existing methods rely on separate high- and low-level networks and generate only a single intermediate subgoal, making them inadequate for complex tasks that require coordinating multiple intermediate decisions. To address this limitation, we draw inspiration from the chain-of-thought paradigm and propose the Chain-of-Goals Hierarchical Policy (CoGHP), a novel framework that reformulates hierarchical decision-making as autoregressive sequence modeling within a unified architecture. Given a state and a final goal, CoGHP autoregressively generates a sequence of latent subgoals followed by the primitive action, where each latent subgoal acts as a reasoning step that conditions subsequent predictions. To implement this efficiently, we pioneer the use of an MLP-Mixer backbone, which supports cross-token communication and captures structural relationships among state, goal, latent subgoals, and action. Across challenging navigation and manipulation benchmarks, CoGHP consistently outperforms strong offline baselines, demonstrating improved performance on long-horizon tasks.
翻译:离线目标条件强化学习在长时域任务中仍然面临挑战。尽管分层方法通过任务分解缓解了这一问题,但现有方法大多依赖独立的高层与低层网络,且仅生成单一中间子目标,使其难以胜任需要协调多个中间决策的复杂任务。为解决这一局限,我们受思维链范式的启发,提出目标链式分层策略——一种在统一架构中将分层决策重构为自回归序列建模的新颖框架。给定状态与最终目标,CoGHP 以自回归方式生成一系列潜在子目标及原始动作,其中每个潜在子目标作为推理步骤条件化后续预测。为实现高效计算,我们率先采用MLP-Mixer骨干网络,该网络支持跨令牌通信并能捕捉状态、目标、潜在子目标与动作间的结构关系。在具有挑战性的导航与操作基准测试中,CoGHP 持续优于现有离线基线方法,显著提升了长时域任务的性能表现。