Hierarchical policies decompose language-conditioned long-horizon robotic manipulation into a high-level planner and a low-level controller. However, effective coordination between HL and LL requires that both components operate on compatible subgoal distributions. We propose ORCHID, a self-training framework that enables stable online improvement of hierarchical diffusion policies by aligning planning and control through iterative refinement. By filtering policy samples via environment feedback, ORCHID identifies trajectories where the planner and controller are jointly successful and distills them back into both modules via supervised learning. This process induces a bidirectional co-adaptation: the planner grounds its subgoals in the actual reaching capabilities of the controller, while the controller specializes in the trajectory structures the planner produces. By relying on supervised distillation of filtered on-policy samples, ORCHID avoids the instability typical of online hierarchical gradient-based RL training with diffusion models. On the CALVIN benchmark, ORCHID allows a lightweight, initially weak model to outperform pure offline methods, including a Vision-Language-Action model twice its size.
翻译:层次化策略将语言引导的长时域机器人操作分解为高层规划器与低层控制器。然而,高层与低层之间的有效协调要求两个组件在兼容的子目标分布上运行。我们提出ORCHID——一个通过迭代精化对齐规划与控制、实现层次化扩散策略稳定在线提升的自训练框架。通过环境反馈过滤策略样本,ORCHID识别出规划器与控制器共同成功的轨迹,并通过监督学习将其蒸馏回两个模块。该过程引发双向共适应:规划器将子目标锚定于控制器的实际可达能力,而控制器则专精于规划器生成的轨迹结构。通过依赖过滤后的在线策略样本的监督蒸馏,ORCHID避免了在线层次化梯度强化学习与扩散模型训练中固有的不稳定性。在CALVIN基准测试中,ORCHID使一个轻量级、初始性能较弱的模型超越了纯离线方法,包括一个规模为其两倍的视觉-语言-动作模型。