Vision-language-action (VLA) models have shown strong generalization for robotic action prediction through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. The core of our approach is the sensor-masked image, a unified representation that overlays spatially grounded and physically meaningful masks onto the RGB images, derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Built on this, we present a multisensory vision-language-action model architecture and train the model based on an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks where sensor-modality perception guides the robotic manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.
翻译:视觉-语言-动作(VLA)模型通过大规模视觉-语言预训练,在机器人动作预测方面展现出强大的泛化能力。然而,现有模型大多仅依赖RGB相机,限制了其感知能力,进而影响了操作性能。本文提出OmniVLA,一种全模态VLA模型,它整合了超越RGB感知的新型传感模态,以实现物理基础的空间智能。我们方法的核心是传感器掩码图像——一种统一表征,它将源自红外相机、毫米波雷达和麦克风阵列等传感器的、具有空间基础及物理意义的掩码叠加到RGB图像上。这种图像本位的统一方式使传感器输入保持接近RGB数据的统计特性以促进训练,为不同传感器硬件提供了统一接口,并支持通过轻量级单传感器投影器实现数据高效学习。在此基础上,我们提出了一种多感官视觉-语言-动作模型架构,并基于RGB预训练的VLA主干网络对模型进行训练。我们在传感器模态感知引导机器人操作的挑战性现实任务中评估OmniVLA。OmniVLA实现了84%的平均任务成功率,显著优于仅使用RGB的基线模型和原始传感器输入基线模型,分别超出59%和28%,同时展现出更高的学习效率和更强的泛化能力。