Diffusion-based trajectory planners can synthesize rich, multimodal action sequences for offline reinforcement learning, but their iterative denoising incurs substantial inference-time cost, making closed-loop planning slow under tight compute budgets. We study the problem of achieving diffusion-like trajectory planning behavior with one-step inference, while retaining the ability to sample diverse candidate plans and condition on the current state in a receding-horizon control loop. Our key observation is that conditional trajectory generation fails under naïve distribution-matching objectives when the similarity measure used to align generated trajectories with the dataset is dominated by unconstrained future dimensions. In practice, this causes attraction toward average trajectories, collapses action diversity, and yields near-static behavior. Our key insight is that conditional generative planning requires a conditioning-aware notion of neighborhood: trajectory updates should be computed using distances in a compact key space that reflects the condition, while still applying updates in the full trajectory space. Building on this, we introduce Keyed Drifting Policies (KDP), a one-step trajectory generator trained with a drift-field objective that attracts generated trajectories toward condition-matched dataset windows and repels them from nearby generated samples, using a stop-gradient drifted target to amortize iterative refinement into training. At inference, the resulting policy produces a full trajectory window in a single forward pass. Across standard RL benchmarks and real-time hardware deployments, KDP achieves strong performance with one-step inference and substantially lower planning latency than diffusion sampling. Project website, code and videos: https://keyed-drifting.github.io/
翻译:基于扩散的轨迹规划器能够为离线强化学习合成丰富、多模态的动作序列,但其迭代去噪过程会带来高昂的推理时间成本,使得在严格的计算预算下进行闭环规划变得缓慢。我们研究如何通过一步推理实现类似扩散的轨迹规划行为,同时保留采样多样化候选计划的能力,并在滚动时域控制循环中以当前状态为条件。我们的关键观察是:当用于对齐生成轨迹与数据集的相似性度量被无约束的未来维度主导时,条件轨迹生成在简单的分布匹配目标下会失败。实践中,这会导致生成轨迹被吸引至平均轨迹、动作多样性坍缩,并产生近乎静态的行为。我们的核心洞见是:条件生成规划需要一种条件感知的邻域概念——轨迹更新应在一个反映条件的紧凑关键空间内通过距离计算,同时仍在完整轨迹空间中应用更新。基于此,我们提出了关键漂移策略(KDP),这是一种通过漂移场目标训练的一步式轨迹生成器。该目标将生成轨迹吸引至条件匹配的数据集窗口,并排斥附近的生成样本,同时使用带停止梯度的漂移目标将迭代优化摊销到训练过程中。在推理时,所得策略通过单次前向传播即可生成完整的轨迹窗口。在标准强化学习基准和实时硬件部署中,KDP通过一步推理实现了强劲性能,其规划延迟显著低于扩散采样。项目网站、代码与视频:https://keyed-drifting.github.io/