Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning

Hamilton-Jacobi-Bellman theory implies that the optimal goal-conditioned action depends on the goal only through the gradient of the goal-reaching distance at the current state, yet standard online GCRL still conditions the actor on the raw goal -- a signal that is geometrically uninformative when the goal is far from the data distribution. We propose Direction-Conditioned Policies (DCP), a fully online method that decomposes goal-reaching into two components sharing one InfoNCE representation $ψ$: a subgoal-scoring step that selects a visited state $z_t$ aligned with the final goal $g$ in $ψ_g$, and a direction-conditioned actor that consumes the unit direction $d_t$ and magnitude $r_t$ from $ψ(s_t)$ to $ψ(z_t)$. The two components train jointly, factor cleanly at deployment (subgoal scoring is removed, while direction conditioning remains with $g$ in place of $z_t$), and admit independent modification at the same $(d_t,r_t)$ interface. We prove three results. First, direction sufficiency under HJB: the optimal action under control-affine dynamics depends on the goal only through the value gradient. Second, a quantitative bound showing that, under mild conditions on the learned representation and assuming the scoring rule returns an on-path $z_t$, the actor's conditioning input at training and at deployment coincide up to representation error and geodesic slack. Third, a controllable-subspace characterization of when directional conditioning fails. Across nine environments, DCP improves over Contrastive RL on most final metrics, with the largest gains on manipulation and obstacle-interaction tasks; a qualitative analysis of the learned $ψ$-distance landscape shows the contrastive representation behaves as an online quasimetric encoding environment topology, and the single failure case (AntSoccer) localizes to a learned-gradient pathology that the theory anticipates.

翻译：Hamilton-Jacobi-Bellman理论表明，最优目标条件动作仅通过当前状态下到达目标距离的梯度与目标相关，然而标准在线GCRL仍将原始目标作为演员网络的输入——当目标远离数据分布时，该信号在几何上缺乏信息性。我们提出方向条件策略（DCP），这是一种完全在线的方法，将到达目标分解为两个共享同一个InfoNCE表示$ψ$的组件：一个子目标评分步骤，用于选择与最终目标$g$在$ψ$空间中方向对齐的已访问状态$z_t$；以及一个方向条件演员网络，该网络接收从$ψ(s_t)$到$ψ(z_t)$的单位方向$d_t$和幅度$r_t$。两个组件联合训练，在部署时清晰分解（子目标评分被移除，而方向条件保留并以$g$替代$z_t$），并在相同的$(d_t,r_t)$接口下允许独立修改。我们证明了三个结果。第一，HJB框架下的方向充分性：在控制仿射动力学下，最优动作仅通过值函数梯度与目标相关。第二，一个定量界限表明，在所学表示的温和条件下，若评分规则返回路径上的$z_t$，则演员网络在训练和部署时的条件输入在表示误差和测地松弛度范围内一致。第三，方向条件失效的可控制子空间刻画。在九个环境中，DCP在大多数最终指标上优于对比RL，在操作和障碍交互任务中提升最大；对所学$ψ$-距离景观的定性分析表明，对比表示表现如同编码环境拓扑的在线拟度量，而唯一的失败案例（AntSoccer）可定位于理论预期的所学梯度病理。