World action models inherit the predictive capability of world models, enabling action generation to be guided by anticipated future observations. However, they rely primarily on vision and often fail in contact-rich manipulation, where critical cues arise from physical interaction. In this paper, we propose Dream-Tac, a unified Tactile-World Action Model that jointly models actions, future visual observations, and tactile dynamics. Specifically, Dream-Tac introduces (i) contact-gated visuotactile fusion to selectively integrate tactile signals and (ii) a contact-aware attention bias to better regulate cross-modal interactions during manipulation. To support real-time deployment, we further design a dual-level acceleration strategy, reformulating the contact-aware bias to preserve the fused attention path during training and introducing cache-based diffusion acceleration at inference, achieving up to 2.9$\times$ faster training and 1.8$\times$ faster inference. Across six contact-rich manipulation tasks, Dream-Tac improves action accuracy by 31.7\% on average, demonstrating the effectiveness of unified visuotactile world modeling.Code is available at https://github.com/LYFCLOUDFAN/Dream-Tac.
翻译:世界动作模型继承了世界模型的预测能力,使得动作生成能够由预期的未来观测结果引导。然而,它们主要依赖视觉,在接触密集操作中常常失败,因为关键线索来自物理交互。在本文中,我们提出Dream-Tac,一种统一的触觉-世界动作模型,它联合建模动作、未来视觉观测和触觉动力学。具体来说,Dream-Tac引入了(i)接触门控视觉-触觉融合,以选择性整合触觉信号,以及(ii)接触感知注意力偏置,以在操作期间更好地调节跨模态交互。为了支持实时部署,我们进一步设计了一种双层加速策略,将接触感知偏置重新表述以在训练期间保留融合的注意力路径,并在推理时引入基于缓存的扩散加速,实现训练速度提升高达2.9倍,推理速度提升1.8倍。在六项接触密集操作任务中,Dream-Tac平均将动作精度提高了31.7%,证明了统一视觉-触觉世界建模的有效性。代码可在https://github.com/LYFCLOUDFAN/Dream-Tac获取。