Robot-DIFT: Distilling Diffusion Features for Geometrically Consistent Visuomotor Control

We hypothesize that a key bottleneck in generalizable robot manipulation is not solely data scale or policy capacity, but a structural mismatch between current visual backbones and the physical requirements of closed-loop control. While state-of-the-art vision encoders (including those used in VLAs) optimize for semantic invariance to stabilize classification, manipulation typically demands geometric sensitivity the ability to map millimeter-level pose shifts to predictable feature changes. Their discriminative objective creates a "blind spot" for fine-grained control, whereas generative diffusion models inherently encode geometric dependencies within their latent manifolds, encouraging the preservation of dense multi-scale spatial structure. However, directly deploying stochastic diffusion features for control is hindered by stochastic instability, inference latency, and representation drift during fine-tuning. To bridge this gap, we propose Robot-DIFT, a framework that decouples the source of geometric information from the process of inference via Manifold Distillation. By distilling a frozen diffusion teacher into a deterministic Spatial-Semantic Feature Pyramid Network (S2-FPN), we retain the rich geometric priors of the generative model while ensuring temporal stability, real-time execution, and robustness against drift. Pretrained on the large-scale DROID dataset, Robot-DIFT demonstrates superior geometric consistency and control performance compared to leading discriminative baselines, supporting the view that how a model learns to see dictates how well it can learn to act.

翻译：我们假设，可泛化机器人操作的一个关键瓶颈不仅在于数据规模或策略容量，更在于当前视觉主干网络与闭环控制的物理需求之间存在结构性失配。尽管最先进的视觉编码器（包括那些用于视觉语言模型中的编码器）为稳定分类而优化语义不变性，但操作任务通常需要几何敏感性——即能够将毫米级的位姿变化映射为可预测的特征变化。其判别式目标为精细控制创造了一个“盲区”，而生成式扩散模型则在其潜在流形中内在地编码了几何依赖性，鼓励保留稠密的多尺度空间结构。然而，直接将随机扩散特征用于控制会受到随机不稳定性、推理延迟以及微调过程中的表征漂移的阻碍。为弥合这一差距，我们提出了Robot-DIFT框架，该框架通过流形蒸馏将几何信息的来源与推理过程解耦。通过将冻结的扩散教师模型蒸馏到一个确定性的空间-语义特征金字塔网络中，我们保留了生成模型的丰富几何先验，同时确保了时间稳定性、实时执行能力以及对漂移的鲁棒性。在大规模DROID数据集上预训练后，Robot-DIFT相比领先的判别式基线方法，展现出更优的几何一致性和控制性能，这支持了以下观点：模型如何学会“看”决定了其如何能学会“行动”。