Robot manipulation often fails in the final millimeters: a policy may recognize the right object yet miss the pose offsets, boundaries, or pre-contact alignments needed for action. We argue that such failures arise when semantic invariance suppresses correspondence cues for closed-loop control, or when these cues are not exposed to the policy in a usable form. Modern visual encoders provide strong semantic abstractions, but contact-rich manipulation requires correspondence sensitivity: discriminative feature responses to action-relevant changes in pose, boundary, and contact geometry. Diffusion features provide a strong prior for dense correspondence, but direct use is impractical due to stochasticity, latency, and representation drift. We introduce Robot-DIFT, a deterministic diffusion-derived backbone for real-time control. Through Manifold Distillation, Robot-DIFT converts a noise-conditioned diffusion Teacher into a clean-input, single-pass Student while preserving the teacher's feature manifold. A Spatial--Semantic Feature Pyramid Network (S2-FPN) fuses coarse-to-fine Student decoder features into visual tokens that expose semantic context and fine contact detail to the policy. Across RoboCasa, LIBERO-10, and real robots, Robot-DIFT outperforms vision--language, self-supervised, geometry-oriented, and diffusion baselines on contact-sensitive tasks. Controlled backbone/readout swaps show that S2-FPN unlocks, rather than replaces, the diffusion correspondence prior.
翻译:机器人操作常在最后毫米级尺度失败:策略可能正确识别目标物体,却无法捕捉动作所需的位姿偏移、边界特征或预接触对齐。我们认为,此类失败源于语义不变性压制了闭环控制所需的对应线索,或这些线索未能以可用形式暴露给策略。现代视觉编码器提供强大的语义抽象,但密集接触操作要求对应灵敏度:即对动作相关的位姿、边界和接触几何变化的判别性特征响应。扩散特征为密集对应提供强先验,但直接使用因随机性、延迟和表示漂移而不可行。本文提出Robot-DIFT——面向实时控制的确定性扩散派生骨干网络。通过流形蒸馏,Robot-DIFT将噪声条件扩散教师网络转化为清洁输入单次前馈学生网络,同时保留教师的特征流形。空间-语义特征金字塔网络(S2-FPN)将粗到细的学生解码器特征融合为视觉令牌,向策略暴露语义上下文与精细接触细节。在RoboCasa、LIBERO-10及真实机器人实验中,Robot-DIFT在接触敏感任务上优于视觉语言、自监督、几何导向及扩散基线方法。控制变量骨干/读出替换实验表明,S2-FPN是释放而非替代扩散对应先验的关键模块。