Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation

Vision-Language-Action (VLA) models have shown remarkable generalization by mapping web-scale knowledge to robotic control, yet they remain blind to physical contact. Consequently, they struggle with contact-rich manipulation tasks that require reasoning about force, texture, and slip. While some approaches incorporate low-dimensional tactile signals, they fail to capture the high-resolution dynamics essential for such interactions. To address this limitation, we introduce DreamTacVLA, a framework that grounds VLA models in contact physics by learning to feel the future. Our model adopts a hierarchical perception scheme in which high-resolution tactile images serve as micro-vision inputs coupled with wrist-camera local vision and third-person macro vision. To reconcile these multi-scale sensory streams, we first train a unified policy with a Hierarchical Spatial Alignment (HSA) loss that aligns tactile tokens with their spatial counterparts in the wrist and third-person views. To further deepen the model's understanding of fine-grained contact dynamics, we finetune the system with a tactile world model that predicts future tactile signals. To mitigate tactile data scarcity and the wear-prone nature of tactile sensors, we construct a hybrid large-scale dataset sourced from both high-fidelity digital twin and real-world experiments. By anticipating upcoming tactile states, DreamTacVLA acquires a rich model of contact physics and conditions its actions on both real observations and imagined consequences. Across contact-rich manipulation tasks, it outperforms state-of-the-art VLA baselines, achieving up to 95% success, highlighting the importance of understanding physical contact for robust, touch-aware robotic agents.

翻译：视觉-语言-动作（VLA）模型通过将网络规模的知识映射到机器人控制中展现了卓越的泛化能力，然而它们对物理接触仍处于“盲视”状态。因此，这类模型在处理需要推理力、纹理和滑移的接触丰富操作任务时面临困难。尽管已有一些方法引入了低维触觉信号，但它们未能捕捉对此类交互至关重要的高分辨率动态信息。为应对这一局限，我们提出了DreamTacVLA框架——该框架通过“学习感知未来”，将VLA模型建立在接触物理学的根基之上。我们的模型采用分层感知方案：高分辨率触觉图像作为微观视觉输入，与腕部摄像头的局部视觉及第三人称宏观视觉相结合。为协调这些多尺度感知流，我们首先通过分层空间对齐（HSA）损失训练统一策略，该损失将触觉标记与腕部和第三人称视角中的对应空间位置对齐。为进一步深化模型对细粒度接触动态的理解，我们利用触觉世界模型对系统进行微调，该模型能够预测未来触觉信号。为缓解触觉数据稀缺及触觉传感器易磨损的问题，我们构建了混合大规模数据集，其数据同时来源于高保真数字孪生实验和真实世界实验。通过预判即将到来的触觉状态，DreamTacVLA获得了丰富的接触物理模型，并将其动作决策同时建立在真实观测与想象结果之上。在多种接触丰富的操作任务中，其性能均优于最先进的VLA基线模型，成功率最高达95%，这凸显了理解物理接触对于构建鲁棒、具触觉感知的机器人智能体的重要性。