Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify the action prediction. To enable this efficient shortcut, we introduce a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model (VFM) representations extracted from the diffusion model's own multi-step generated videos provide teacher targets. Lightweight decouplers, as students, learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that our S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments. Our project page is https://haodong-yan.github.io/S-VAM/
翻译:视频-动作模型因其对复杂操作任务具备强大的视觉预见能力,已成为机器人学习领域一种前景广阔的研究范式。然而,当前视频-动作模型通常依赖于缓慢的多步视频生成或噪声较多的一步特征提取,无法同时保证实时推理与高保真度预见。为解决这一局限,我们提出了S-VAM,一种通过单次前向传播即可预见连贯几何与语义表征的捷径视频-动作模型。这些预见表征作为稳定的蓝图,显著简化了动作预测过程。为实现这一高效捷径,我们引入了一种新颖的自蒸馏策略,将多步去噪的结构化生成先验压缩至一步推理中。具体而言,从扩散模型自身多步生成视频中提取的视觉基础模型表征提供了教师目标。作为学生的轻量化解耦器则学习将带噪声的一步特征直接映射至这些目标。在仿真与真实环境中的大量实验表明,我们的S-VAM优于现有最优方法,能够在复杂环境中实现高效且精确的操作。项目页面为 https://haodong-yan.github.io/S-VAM/