Diffusion policies are a powerful paradigm for robotic control, but fine-tuning them with human preferences is fundamentally challenged by the multi-step structure of the denoising process. To overcome this, we introduce a Unified Markov Decision Process (MDP) formulation that coherently integrates the diffusion denoising chain with environmental dynamics, enabling reward-free Direct Preference Optimization (DPO) for diffusion policies. Building on this formulation, we propose RoDiF (Robust Direct Fine-Tuning), a method that explicitly addresses corrupted human preferences. RoDiF reinterprets the DPO objective through a geometric hypothesis-cutting perspective and employs a conservative cutting strategy to achieve robustness without assuming any specific noise distribution. Extensive experiments on long-horizon manipulation tasks show that RoDiF consistently outperforms state-of-the-art baselines, effectively steering pretrained diffusion policies of diverse architectures to human-preferred modes, while maintaining strong performance even under 30% corrupted preference labels.
翻译:扩散策略是机器人控制的一种强大范式,但利用人类偏好对其进行微调时,根本上受到去噪过程多步结构的挑战。为克服此问题,我们引入了一种统一马尔可夫决策过程(MDP)公式,该公式将扩散去噪链与环境动态相协调地整合,从而实现了扩散策略的无奖励直接偏好优化(DPO)。基于此公式,我们提出了RoDiF(鲁棒直接微调),该方法明确处理受损的人类偏好。RoDiF通过几何假设切割的视角重新阐释了DPO目标,并采用保守切割策略以实现鲁棒性,且无需假设任何特定的噪声分布。在长时程操作任务上的大量实验表明,RoDiF始终优于最先进的基线方法,能有效将不同架构的预训练扩散策略引导至人类偏好的模式,即使在30%的偏好标签受损情况下仍能保持强劲性能。