Force sensing is a crucial modality for Vision-Language-Action (VLA) frameworks, as it enables fine-grained perception and dexterous manipulation in contact-rich tasks. We present Force-Distilled VLA (FD-VLA), a novel framework that integrates force awareness into contact-rich manipulation without relying on physical force sensors. The core of our approach is a Force Distillation Module (FDM), which distills force by mapping a learnable query token, conditioned on visual observations and robot states, into a predicted force token aligned with the latent representation of actual force signals. During inference, this distilled force token is injected into the pretrained VLM, enabling force-aware reasoning while preserving the integrity of its vision-language semantics. This design provides two key benefits: first, it allows practical deployment across a wide range of robots that lack expensive or fragile force-torque sensors, thereby reducing hardware cost and complexity; second, the FDM introduces an additional force-vision-state fusion prior to the VLM, which improves cross-modal alignment and enhances perception-action robustness in contact-rich scenarios. Surprisingly, our physical experiments show that the distilled force token outperforms direct sensor force measurements as well as other baselines, which highlights the effectiveness of this force-distilled VLA approach.
翻译:力觉传感是视觉-语言-动作(VLA)框架中的关键模态,能够实现对密集接触任务中细微感知与灵巧操作的支撑。我们提出力蒸馏VLA(FD-VLA)——一种无需依赖物理力传感器的创新框架,将力感知能力融入密集接触操作。其核心在于力蒸馏模块(FDM),通过将视觉观测与机器人状态作为条件输入的可学习查询令牌映射为预测力令牌,使其与真实力信号的潜在表征对齐,从而实现力信息的蒸馏。推理阶段,蒸馏后的力令牌注入预训练的视觉语言模型(VLM),在保持视觉-语言语义完整性的同时赋予其力感知推理能力。该设计具备两大优势:其一,可在缺乏昂贵或易损力-力矩传感器的大规模机器人平台上实现实用化部署,从而降低硬件成本与系统复杂度;其二,FDM在VLM前引入额外的力-视觉-状态融合先验,强化了跨模态对齐,并增强了密集接触场景下的感知-动作鲁棒性。令人惊讶的是,物理实验表明蒸馏力令牌的性能不仅超越了直接传感器力测量结果,还优于其他基线方法,这充分证明了该力蒸馏VLA方法的有效性。