ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-stage spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based architecture for learning task-agnostic visuotactile representations from paired vision and tactile inputs. Our key idea is a two-stage positional injection: local (modality-specific) positional encodings are added within each stream, and a global positional encoding is added on the joint token sequence immediately before attention, providing a shared positional vocabulary at the stage where cross-modal interaction occurs. We make the positional injection points explicit and conduct controlled ablations that isolate their effect before a token-wise nonlinearity versus immediately before self-attention. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of \emph{ViTaPEs} in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success. Project page: https://sites.google.com/view/vitapes

翻译：触觉感知能够提供与视觉互补的局部关键信息，如纹理、柔顺度和力。尽管近年来视触觉表征学习取得了进展，但在不严重依赖预训练视觉-语言模型的情况下，如何融合这些模态并跨任务与环境进行泛化仍是挑战。此外，现有方法未研究位置编码，从而忽略了捕获细粒度视触觉相关性所需的多阶段空间推理。我们提出ViTaPEs——一种基于Transformer的架构，用于从配对视觉与触觉输入中学习任务无关的视触觉表征。其核心思想是两阶段位置注入：在每个流内添加局部（模态特定）位置编码，并在注意力机制前立即在联合令牌序列上添加全局位置编码，从而为跨模态交互阶段提供共享位置词汇表。我们明确了位置注入点，并通过控制消融实验分离了在令牌级非线性变换之前与自注意力之前注入的效果。在多个大规模真实世界数据集上的实验表明，ViTaPEs不仅在各种识别任务上超越现有最优基线，还在未见过的跨域场景中展现出零样本泛化能力。我们进一步在机器人抓取任务中验证了ViTaPEs的迁移学习优势——在预测抓取成功率方面优于现有最优基线。项目页面：https://sites.google.com/view/vitapes