Contact-rich manipulation requires world models to reason over complex contact dynamics from multimodal sensory observations. However, it remains unclear which representation properties fundamentally support stable long-horizon planning in contact-rich settings. In this paper, we present ContactWorld, a benchmark and systematic empirical study of vision-tactile world models spanning 12 contact-rich manipulation tasks, including insertion, disassembly, screwing, and exploratory interaction. Across extensive experiments, we find that representations that are both spatially structured and temporally continuous consistently achieve the strongest planning performance. In particular, point-cloud observations improve average planning success rates from 20.7% with wrist-view observations and 22.0% with front-view observations to 32.1%. We further find that the effectiveness of tactile sensing depends critically on cross-modal representation compatibility rather than modality scaling alone. Combining point-cloud observations with tactile force-field representations, which preserve richer spatial structure and interaction dynamics, further improves performance to 36.1%, yielding the strongest overall planning performance across all evaluated tasks. Moreover, tactile sensing becomes increasingly important under long-horizon planning objectives, where compounding prediction errors and contact uncertainty accumulate over time. Together, these findings highlight the importance of representation structure, multimodal compatibility, and long-horizon robustness in vision-tactile world models for contact-rich robotic manipulation.
翻译:[translated abstract in Chinese]
密集接触操作要求世界模型能够基于多模态感知观测推理复杂的接触动力学。然而,当前尚不明确何种表征属性从根本上支撑了密集接触场景下稳定的长时域规划。本文提出ContactWorld基准测试与系统性实证研究,涵盖12项密集接触操作任务(包括插入、拆卸、旋拧及探索性交互)。通过大规模实验发现,兼具空间结构性与时间连续性的表征能稳定实现最优规划性能。具体而言,点云观测将平均规划成功率从腕部视角观测的20.7%及前向视角观测的22.0%提升至32.1%。进一步研究表明,触觉感知的有效性关键取决于跨模态表征兼容性而非单纯模态规模扩展。将点云观测与保留更丰富空间结构与交互动力学的触觉力场表征相结合,性能进一步提升至36.1%,在所有评估任务中取得最优规划表现。此外,在长时域规划目标下——此时复合预测误差与接触不确定性随时间累积——触觉感知的重要性愈发显著。这些发现共同强调了适用于密集接触机器人操作的世界模型中表征结构、多模态兼容性与长时域鲁棒性的关键作用。