MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation

Humans naturally leverage diverse sensing modalities to interact with the physical world, while most Vision-Language-Action (VLA) models for robotics rely solely on RGB observations. This limits their ability to perceive physical properties that are difficult or impossible to infer from RGB cameras, such as temperature, sound, or radar response. We present MuseVLA, an adaptive multimodal sensing VLA model that integrates novel sensors as on-demand tools for robotic manipulation. Given a task instruction and visual context, MuseVLA first generates a sensor token and target description that select the sensing modality to invoke and what to attend to, analogous to a tool call with arguments. It then converts the selected sensor measurement into a grounded sensor image, a unified intermediate representation that encodes heterogeneous readings for multimodal fusion and action generation. This design decouples sensor-specific processing from the VLA backbone, enabling efficient integration of diverse modalities. To reduce the need for expensive multisensory robot datasets, we further introduce a data synthesis pipeline that augments existing RGB video datasets with grounded sensor images, enabling generalization to unseen sensor-guided tasks. We evaluate MuseVLA on a real-world robot across challenging dexterous hand manipulation tasks that require multimodal sensing inputs, including temperature-guided pick-and-place, audio-driven object search, and radar-assisted hidden object retrieval. MuseVLA achieves 80.6% success rate on average, outperforming RGB-only and multisensory VLA baselines significantly, and exhibits strong zero-shot capabilities on unseen tasks.

翻译：人类天然地依赖多种感知模态与物理世界交互，而当前大多数用于机器人的视觉-语言-动作（VLA）模型仅依赖RGB观测。这限制了它们感知温度、声音或雷达响应等难以或无法从RGB相机推断的物理属性的能力。我们提出MuseVLA，一种自适应多模态感知VLA模型，该模型将新型传感器作为按需工具集成到机器人操作中。给定任务指令和视觉上下文后，MuseVLA首先生成一个传感器令牌和目标描述，以选择要调用的感知模态及关注对象，类似于带参数的函数调用。随后，它将选定的传感器测量结果转换为接地传感器图像（一种统一中间表征），该表征对异构读数进行编码，以用于多模态融合和动作生成。这种设计将传感器特定处理与VLA主干解耦，从而能够高效集成多种模态。为降低对昂贵多感官机器人数据集的需求，我们进一步引入一种数据合成流水线，该流水线使用接地传感器图像增强现有RGB视频数据集，从而实现对未见过的传感器引导任务的泛化。我们在真实机器人上评估了MuseVLA，针对需要多模态感知输入的挑战性灵巧手操作任务，包括温度引导的抓取-放置、音频驱动的物体搜索及雷达辅助的隐藏物体找回。MuseVLA平均成功率达80.6%，显著优于仅使用RGB和多感官VLA基线，并在未见任务上展现出强大的零样本能力。