FG-CLTP: Fine-Grained Contrastive Language Tactile Pretraining for Robotic Manipulation

Recent advancements in integrating tactile sensing into vision-language-action (VLA) models have demonstrated transformative potential for robotic perception. However, existing tactile representations predominantly rely on qualitative descriptors (e.g., texture), neglecting quantitative contact states such as force magnitude, contact geometry, and principal axis orientation, which are indispensable for fine-grained manipulation. To bridge this gap, we propose FG-CLTP, a fine-grained contrastive language tactile pretraining framework. We first introduce a novel dataset comprising over 100k tactile 3D point cloud-language pairs that explicitly capture multidimensional contact states from the sensor's perspective. We then implement a discretized numerical tokenization mechanism to achieve quantitative-semantic alignment, effectively injecting explicit physical metrics into the multimodal feature space. The proposed FG-CLTP model yields a 95.9% classification accuracy and reduces the regression error (MAE) by 52.6% compared to state-of-the-art methods. Furthermore, the integration of 3D point cloud representations establishes a sensor-agnostic foundation with a minimal sim-to-real gap of 3.5%. Building upon this fine-grained representation, we develop a 3D tactile-language-action (3D-TLA) architecture driven by a flow matching policy to enable multimodal reasoning and control. Extensive experiments demonstrate that our framework significantly outperforms strong baselines in contact-rich manipulation tasks, providing a robust and generalizable foundation for tactile-language-action models.

翻译：将触觉传感集成到视觉-语言-动作（VLA）模型中的最新进展，已展现出对机器人感知的变革性潜力。然而，现有的触觉表征主要依赖于定性描述符（例如纹理），忽略了定量的接触状态，如力的大小、接触几何形状和主轴方向，而这些对于细粒度操作是不可或缺的。为弥补这一差距，我们提出了FG-CLTP，一个细粒度的对比语言触觉预训练框架。我们首先引入了一个新颖的数据集，包含超过10万个触觉3D点云-语言对，这些数据对从传感器视角明确捕获了多维接触状态。随后，我们实现了一种离散化的数值标记机制，以实现定量-语义对齐，从而将明确的物理度量有效地注入到多模态特征空间中。所提出的FG-CLTP模型实现了95.9%的分类准确率，并将回归误差（MAE）相较于最先进方法降低了52.6%。此外，3D点云表征的集成为模型建立了一个与传感器无关的基础，其仿真到现实的差距最小仅为3.5%。基于这种细粒度表征，我们开发了一个由流匹配策略驱动的3D触觉-语言-动作（3D-TLA）架构，以实现多模态推理与控制。大量实验表明，我们的框架在接触丰富的操作任务中显著优于强基线方法，为触觉-语言-动作模型提供了一个鲁棒且可泛化的基础。