Vision-Language-Action (VLA) models have recently emerged as powerful generalists for robotic manipulation. However, due to their predominant reliance on visual modalities, they fundamentally lack the physical intuition required for contact-rich tasks that require precise force regulation and physical reasoning. Existing attempts to incorporate vision-based tactile sensing into VLA models typically treat tactile inputs as auxiliary visual textures, thereby overlooking the underlying correlation between surface deformation and interaction dynamics. To bridge this gap, we propose a paradigm shift from tactile-vision alignment to tactile-force alignment. Here, we introduce TaF-VLA, a framework that explicitly grounds high-dimensional tactile observations in physical interaction forces. To facilitate this, we develop an automated tactile-force data acquisition device and curate the TaF-Dataset, comprising over 10 million synchronized tactile observations, 6-axis force/torque, and matrix force map. To align sequential tactile observations with interaction forces, the central component of our approach is the Tactile-Force Adapter (TaF-Adapter), a tactile sensor encoder that extracts discretized latent information for encoding tactile observations. This mechanism ensures that the learned representations capture history-dependent, noise-insensitive physical dynamics rather than static visual textures. Finally, we integrate this force-aligned encoder into a VLA backbone. Extensive real-world experiments demonstrate that TaF-VLA policy significantly outperforms state-of-the-art tactile-vision-aligned and vision-only baselines on contact-rich tasks, verifying its ability to achieve robust, force-aware manipulation through cross-modal physical reasoning.
翻译:视觉-语言-动作(VLA)模型近期已成为机器人操作领域强大的通用架构。然而,由于其主要依赖视觉模态,这类模型本质上缺乏对需要精确力调节与物理推理的密集接触任务所必需的物理直觉。现有将基于视觉的触觉传感融入VLA模型的尝试通常将触觉输入视为辅助视觉纹理,从而忽视了表面形变与交互动力学之间的内在关联。为弥合这一鸿沟,我们提出从触觉-视觉对齐到触觉-力对齐的范式转变。本文提出TaF-VLA框架,该框架将高维触觉观测显式地锚定在物理交互力上。为此,我们开发了自动化触觉-力数据采集装置,并构建了包含超1000万组同步触觉观测、六维力/力矩及矩阵力图的TaF数据集。为实现序列化触觉观测与交互力的对齐,我们方法的核心组件是触觉-力适配器(TaF-Adapter)——一种通过提取离散化潜在信息来编码触觉观测的触觉传感器编码器。该机制确保学习到的表征能够捕获具有历史依赖性、对噪声不敏感的物理动力学特征,而非静态视觉纹理。最终,我们将这种力对齐编码器集成到VLA骨干网络中。大量真实世界实验表明,TaF-VLA策略在密集接触任务上显著优于当前最先进的触觉-视觉对齐及纯视觉基线模型,验证了其通过跨模态物理推理实现鲁棒、力感知操作的能力。