Diffusion strategies have advanced visual motor control by progressively denoising high-dimensional action sequences, providing a promising method for robot manipulation. However, as task complexity increases, the success rate of existing baseline models decreases considerably. Analysis indicates that current diffusion strategies are confronted with two limitations. First, these strategies only rely on short-term observations as conditions. Second, the training objective remains limited to a single denoising loss, which leads to error accumulation and causes grasping deviations. To address these limitations, this paper proposes Foresight-Conditioned Diffusion (ForeDiffusion), by injecting the predicted future view representation into the diffusion process. As a result, the policy is guided to be forward-looking, enabling it to correct trajectory deviations. Following this design, ForeDiffusion employs a dual loss mechanism, combining the traditional denoising loss and the consistency loss of future observations, to achieve the unified optimization. Extensive evaluation on the Adroit suite and the MetaWorld benchmark demonstrates that ForeDiffusion achieves an average success rate of 80% for the overall task, significantly outperforming the existing mainstream diffusion methods by 23% in complex tasks, while maintaining more stable performance across the entire tasks.
翻译:扩散策略通过逐步去噪高维动作序列,推动了视觉运动控制的发展,为机器人操作提供了一种有前景的方法。然而,随着任务复杂度的增加,现有基线模型的成功率显著下降。分析表明,当前的扩散策略面临两个局限性。首先,这些策略仅依赖短期观测作为条件。其次,训练目标仍局限于单一的去噪损失,这会导致误差累积并引起抓取偏差。为解决这些局限性,本文提出了远见条件扩散(ForeDiffusion),通过将预测的未来视图表征注入扩散过程。由此,策略被引导为具有前瞻性,使其能够纠正轨迹偏差。基于此设计,ForeDiffusion采用双重损失机制,结合传统的去噪损失与未来观测的一致性损失,以实现统一优化。在Adroit套件和MetaWorld基准上的广泛评估表明,ForeDiffusion在整体任务中实现了80%的平均成功率,在复杂任务中显著优于现有主流扩散方法23%,同时在整个任务中保持了更稳定的性能。