We introduce multi-task Visuo-Tactile World Models (VT-WM), which capture the physics of contact through touch reasoning. By complementing vision with tactile sensing, VT-WM better understands robot-object interactions in contact-rich tasks, avoiding common failure modes of vision-only models under occlusion or ambiguous contact states, such as objects disappearing, teleporting, or moving in ways that violate basic physics. Trained across a set of contact-rich manipulation tasks, VT-WM improves physical fidelity in imagination, achieving 33% better performance at maintaining object permanence and 29% better compliance with the laws of motion in autoregressive rollouts. Moreover, experiments show that grounding in contact dynamics also translates to planning. In zero-shot real-robot experiments, VT-WM achieves up to 35% higher success rates, with the largest gains in multi-step, contact-rich tasks. Finally, VT-WM demonstrates significant downstream versatility, effectively adapting its learned contact dynamics to a novel task and achieving reliable planning success with only a limited set of demonstrations.
翻译:我们提出了多任务视觉触觉世界模型(VT-WM),该模型通过触觉推理捕捉接触物理。通过将触觉感知与视觉互补,VT-WM能更好地理解接触密集型任务中的机器人-物体交互,避免了纯视觉模型在遮挡或接触状态模糊时常见的失效模式,例如物体消失、瞬移或以违反基础物理学的方式运动。在一系列接触密集型操作任务上进行训练后,VT-WM提升了想象过程中的物理保真度,在自回归推演中,物体恒存性维持性能提高了33%,运动定律符合度提高了29%。此外,实验表明,基于接触动力学的建模也转化为规划能力的提升。在零样本真实机器人实验中,VT-WM实现了高达35%的成功率提升,其中在多步骤、接触密集型任务中收益最大。最后,VT-WM展现出显著的下游任务泛化能力,能有效地将其学习到的接触动力学适应到新任务中,并仅通过少量演示即可实现可靠的规划成功。