OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation

Contact-rich manipulation tasks, such as wiping and assembly, require accurate perception of contact forces, friction changes, and state transitions that cannot be reliably inferred from vision alone. Despite growing interest in visuo-tactile manipulation, progress is constrained by two persistent limitations: existing datasets are small in scale and narrow in task coverage, and current methods treat tactile signals as passive observations rather than using them to model contact dynamics or enable closed-loop control explicitly. In this paper, we present \textbf{OmniViTac}, a large-scale visuo-tactile-action dataset comprising $21{,}000+$ trajectories across $86$ tasks and $100+$ objects, organized into six physics-grounded interaction patterns. Building on this dataset, we propose \textbf{OmniVTA}, a world-model-based visuo-tactile manipulation framework that integrates four tightly coupled modules: a self-supervised tactile encoder, a two-stream visuo-tactile world model for predicting short-horizon contact evolution, a contact-aware fusion policy for action generation, and a 60Hz reflexive controller that corrects deviations between predicted and observed tactile signals in a closed loop. Real-robot experiments across all six interaction categories show that OmniVTA outperforms existing methods and generalizes well to unseen objects and geometric configurations, confirming the value of combining predictive contact modeling with high-frequency tactile feedback for contact-rich manipulation. All data, models, and code will be made publicly available on the project website at https://mrsecant.github.io/OmniVTA.

翻译：密集接触操作任务（如擦拭和装配）需要精确感知接触力、摩擦力变化及状态转变，而这些信息无法仅通过视觉可靠推断。尽管视觉-触觉操作领域日益受到关注，但其进展始终受限于两大瓶颈：现有数据集规模小且任务覆盖范围狭窄；当前方法将触觉信号视为被动观测，而非用于建模接触动力学或显式实现闭环控制。本文提出大规模视觉-触觉-动作数据集**OmniViTac**，包含86项任务、100余种物体的21,000+条轨迹，并按六种基于物理规律的交互模式进行组织。基于该数据集，我们提出基于世界模型的视觉-触觉操作框架**OmniVTA**，该框架集成了四个紧密耦合的模块：自监督触觉编码器、用于预测短时域接触演化的双流视觉-触觉世界模型、用于动作生成的接触感知融合策略，以及以60Hz频率闭环校正触觉预测值与观测值偏差的反射型控制器。在全部六类交互模式上的真实机器人实验表明，OmniVTA优于现有方法，并能良好泛化至未知物体与几何构型，验证了将预测性接触建模与高频触觉反馈相结合对密集接触操作的价值。所有数据、模型及代码将在项目网站https://mrsecant.github.io/OmniVTA 公开。