Rapid deployment of new tactile sensors is essential for scalable robotic manipulation, especially in multi-fingered hands equipped with vision-based tactile sensors. However, current methods for inferring contact properties rely heavily on convolutional neural networks (CNNs), which, while effective on known sensors, require large, sensor-specific datasets. Furthermore, they require retraining for each new sensor due to differences in lens properties, illumination, and sensor wear. Here we introduce TacViT, a novel tactile perception model based on Vision Transformers, designed to generalize on new sensor data. TacViT leverages global self-attention mechanisms to extract robust features from tactile images, enabling accurate contact property inference even on previously unseen sensors. This capability significantly reduces the need for data collection and retraining, accelerating the deployment of new sensors. We evaluate TacViT on sensors for a five-fingered robot hand and demonstrate its superior generalization performance compared to CNNs. Our results highlight TacViTs potential to make tactile sensing more scalable and practical for real-world robotic applications.
翻译:新触觉传感器的快速部署对于可扩展的机器人操作至关重要,尤其是在配备了基于视觉的触觉传感器的多指手中。然而,当前推断接触特性的方法严重依赖卷积神经网络(CNNs),这些网络虽然在已知传感器上表现有效,但需要大量特定于传感器的数据集。此外,由于透镜特性、光照条件和传感器磨损的差异,它们需要针对每个新传感器重新训练。本文提出TacViT,一种基于视觉变换器的新型触觉感知模型,旨在泛化至新传感器数据。TacViT利用全局自注意力机制从触觉图像中提取鲁棒特征,即使在先前未见过的传感器上也能实现精确的接触特性推断。这种能力显著减少了数据收集和重新训练的需求,加速了新传感器的部署。我们在五指机器人手的传感器上评估TacViT,并展示了其相比CNNs的卓越泛化性能。我们的结果突出了TacViT在使触觉感知对实际机器人应用更具可扩展性和实用性的潜力。