Offline goal-conditioned reinforcement learning (GCRL) trains policies that reach user-specified goals at test time, providing a simple, unsupervised, domain-agnostic way to extract diverse behaviors from unlabeled, reward-free datasets. Nonetheless, long-horizon decision making remains difficult for GCRL agents due to temporal credit assignment and error accumulation, and the offline setting amplifies these effects. To alleviate this issue, we introduce Test-Time Graph Search (TTGS), a lightweight planning approach to solve the GCRL task. TTGS accepts any state-space distance or cost signal, builds a weighted graph over dataset states, and performs fast search to assemble a sequence of subgoals that a frozen policy executes. When the base learner is value-based, the distance is derived directly from the learned goal-conditioned value function, so no handcrafted metric is needed. TTGS requires no changes to training, no additional supervision, no online interaction, and no privileged information, and it runs entirely at inference. On the OGBench benchmark, TTGS improves success rates of multiple base learners on challenging locomotion tasks, demonstrating the benefit of simple metric-guided test-time planning for offline GCRL.
翻译:离线目标条件强化学习(GCRL)训练的策略能够在测试时达成用户指定的目标,这为从无标注、无奖励的数据集中提取多样化行为提供了一种简单、无监督、领域无关的方法。然而,由于时序信用分配和误差累积问题,长时程决策对于GCRL智能体而言仍然困难,而离线设置会加剧这些效应。为缓解此问题,我们引入了测试时图搜索(TTGS),一种用于解决GCRL任务的轻量级规划方法。TTGS接受任意状态空间距离或成本信号,在数据集状态上构建加权图,并通过快速搜索来组装一系列子目标,由冻结的策略执行。当基础学习器基于价值函数时,距离可直接从学习到的目标条件价值函数中导出,因此无需人工设计度量标准。TTGS无需改变训练过程、无需额外监督、无需在线交互、无需特权信息,且完全在推理阶段运行。在OGBench基准测试中,TTGS提升了多个基础学习器在具有挑战性的运动任务上的成功率,证明了简单的度量引导测试时规划对于离线GCRL的益处。