In visually ambiguous manipulation such as detecting button click tactile feedback is often the sole source of ground truth. However, fusing tactile data poses a significant challenge due to a spatiotemporal mismatch: tactile perception requires high-frequency processing with long-horizon memory (System 1), whereas visual policies operate at low control frequencies (System 2). Existing architectures struggle to bridge this gap: Transformers are computationally prohibitive for high-frequency loops (>100Hz), while LSTMs suffer from forgetting over extended interaction histories. In this paper, we introduce TacMamba, a hierarchical architecture that aligns high-bandwidth tactile reflexes with low-frequency visual planning. Our approach comprises three core contributions: (1) a custom high-frequency tactile interface designed for flexible integration; (2) a Mamba-based Tactile History Compressor that encodes continuous force history into a compact state with O(1) inference latency (0.45 ms), enabling plug-and-play fusion with VLA models without joint pre-training and (3) a Tactile-Guided Dual-Stage Training strategy that leverages temporal discrimination for self-supervised representation learning and phase-uniform sampling to mitigate data sparsity. Experiments on discrete counting and implicit state switching demonstrate that TacMamba achieves 100% success rates, significantly outperforming the visual-only pi_0.5 baseline, while strictly satisfying hard real-time constraints.
翻译:在诸如检测按钮点击等视觉模糊的操作中,触觉反馈通常是唯一的真实信号来源。然而,融合触觉数据面临着一个显著的时空不匹配挑战:触觉感知需要具备长时记忆的高频处理(系统1),而视觉策略则以低控制频率运行(系统2)。现有架构难以弥合这一鸿沟:Transformer对于高频环路(>100Hz)计算成本过高,而LSTM则在长交互历史中易受遗忘问题困扰。本文介绍了TacMamba,一种将高带宽触觉反射与低频视觉规划对齐的分层架构。我们的方法包含三个核心贡献:(1)一个为灵活集成而设计的定制高频触觉接口;(2)一个基于Mamba的触觉历史压缩器,它将连续的力历史编码为具有O(1)推理延迟(0.45毫秒)的紧凑状态,从而能够与VLA模型实现即插即用式融合,无需联合预训练;以及(3)一种触觉引导的双阶段训练策略,该策略利用时序判别进行自监督表征学习,并采用相位均匀采样以缓解数据稀疏性问题。在离散计数和隐式状态切换任务上的实验表明,TacMamba实现了100%的成功率,显著优于纯视觉的pi_0.5基线,同时严格满足硬实时约束。