Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies. However, current VLAs are mostly trained on large-scale image-text-action data and remain limited in two key ways: (i) they struggle with pixel-level scene understanding, and (ii) they rely heavily on textual prompts, which reduces their flexibility in real-world settings. To address these challenges, we introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs. Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual promptaware encoder. To train PixelVLA effectively, we further propose a two-stage automated annotation pipeline that generates Pixel-160K, a large-scale dataset with pixel-level annotations derived from existing robot data. Experiments on three standard VLA benchmarks and two VLA model variants show that PixelVLA improves manipulation success rates by 10.1%-28.7% over OpenVLA, while requiring only 1.5% of its pretraining cost. These results demonstrate that PixelVLA can be integrated into existing VLAs to enable more accurate, efficient, and versatile robot control in complex environments.
翻译:视觉-语言-动作模型正逐渐成为学习可泛化视觉运动控制策略的强大工具。然而,当前的VLAs大多基于大规模图像-文本-动作数据进行训练,并在两个关键方面存在局限:(i)难以实现像素级场景理解,(ii)过度依赖文本提示,降低了其在真实环境中的灵活性。为解决这些挑战,我们提出PixelVLA——首个支持像素级推理以及文本与视觉输入的多模态提示的VLA模型。我们的方法基于一个新的视觉运动指令微调框架,该框架将多尺度像素感知编码器与视觉提示感知编码器相结合。为有效训练PixelVLA,我们进一步提出两阶段自动化标注流程,生成Pixel-160K——一个基于现有机器人数据且包含像素级标注的大规模数据集。在三个标准VLA基准测试和两个VLA模型变体上的实验表明,相比OpenVLA,PixelVLA将操作成功率提升10.1%-28.7%,同时仅需其1.5%的预训练成本。这些结果表明,PixelVLA可被集成至现有VLA模型中,从而实现复杂环境下更精确、高效且多功能的机器人控制。