The ability to react dynamically to tactile signals has long been considered crucial to agile human-level dexterity. Yet contemporary learning-based Vision-Language-Action (VLA) models for robotic manipulation generally either overlook the tactile modality or are limited to encoders with static cues, due in part to the scarcity of diverse training data and standardized evaluation, architectural constraints in current VLA models, and limitations of static tactile encoders. In this paper, we push the frontier of tactile-reactive manipulation by addressing all of these limitations. We propose a large-scale, 100-hour tactile-rich dataset collected via a novel, data-efficient recipe that prioritizes elementary motor primitives. To effectively exploit naturally high-frequency touch signals without sacrificing the existing capabilities of existing VLAs, we introduce a variable-rate Mixture-of-Transformers (MoT) architecture equipped with a novel temporal tactile VQ-VAE encoder. We demonstrate the effectiveness of tactile-reactive policies on 12 manipulation tasks requiring delicate force control and deformable object manipulation, achieving over 30% higher average success rate than the strongest baseline.
翻译:动态响应触觉信号的能力长期以来被认为是实现类人敏捷灵巧性的关键。然而,当前基于学习的机器人操作视觉-语言-动作(VLA)模型通常要么忽略触觉模态,要么局限于使用静态线索的编码器,部分原因在于多样化训练数据和标准化评估的匮乏、现有VLA模型的架构限制以及静态触觉编码器的局限性。本文通过解决上述所有限制,推动了触觉反应操作的前沿发展。我们提出一个大规模、100小时的丰富触觉数据集,该数据集通过一种优先考虑基本运动基元的新型数据高效方案收集。为有效利用天然高频的触觉信号而不牺牲现有VLA的既有能力,我们引入了一种可变速率混合变换器(MoT)架构,配备了一种新颖的时间触觉VQ-VAE编码器。我们在12项需要精细力控制和可变形物体操作的任务上验证了触觉反应策略的有效性,平均成功率比最强基线高出30%以上。