Vision-Language-Action (VLA) models have shown a strong capability in enabling robots to execute general instructions, yet they struggle with contact-rich manipulation tasks, where success requires precise alignment, stable contact maintenance, and effective handling of deformable objects. A fundamental challenge arises from the imbalance between high-entropy vision and language inputs and low-entropy but critical force signals, which often leads to over-reliance on perception and unstable control. To address this, we introduce CRAFT, a force-aware curriculum fine-tuning framework that integrates a variational information bottleneck module to regulate vision and language embeddings during early training. This curriculum strategy encourages the model to prioritize force signals initially, before progressively restoring access to the full multimodal information. To enable force-aware learning, we further design a homologous leader-follower teleoperation system that collects synchronized vision, language, and force data across diverse contact-rich tasks. Real-world experiments demonstrate that CRAFT consistently improves task success, generalizes to unseen objects and novel task variations, and adapts effectively across diverse VLA architectures, enabling robust and generalizable contact-rich manipulation.
翻译:视觉-语言-动作(VLA)模型在使机器人执行通用指令方面展现出强大能力,但在接触密集型操作任务中表现不佳,这类任务的成功需要精确的对齐、稳定的接触维持以及对可变形物体的有效处理。一个根本性挑战源于高熵的视觉和语言输入与低熵但关键的力信号之间的不平衡,这通常导致模型过度依赖感知并产生不稳定的控制。为解决此问题,我们提出了CRAFT,一个力感知的课程微调框架,该框架集成了一个变分信息瓶颈模块,用于在早期训练阶段调节视觉和语言嵌入。这种课程策略鼓励模型首先优先处理力信号,然后逐步恢复对完整多模态信息的访问。为了支持力感知学习,我们进一步设计了一个同构的领导者-跟随者遥操作系统,用于在多样化的接触密集型任务中收集同步的视觉、语言和力数据。真实世界实验表明,CRAFT能持续提高任务成功率,泛化到未见过的物体和新的任务变体,并有效适应不同的VLA架构,从而实现鲁棒且可泛化的接触密集型操作。