Recent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising vision-language-action (VLA) paradigm. However, most existing approaches overlook the importance of active perception: they typically rely on static, wrist-mounted cameras that provide an end-effector-centric viewpoint. As a result, these models are unable to adaptively select optimal viewpoints or resolutions during task execution, which significantly limits their performance in long-horizon tasks and fine-grained manipulation scenarios. To address these limitations, we propose ActiveVLA, a novel vision-language-action framework that empowers robots with active perception capabilities for high-precision, fine-grained manipulation. ActiveVLA adopts a coarse-to-fine paradigm, dividing the process into two stages: (1) Critical region localization. ActiveVLA projects 3D inputs onto multi-view 2D projections, identifies critical 3D regions, and supports dynamic spatial awareness. (2) Active perception optimization. Drawing on the localized critical regions, ActiveVLA uses an active view selection strategy to choose optimal viewpoints. These viewpoints aim to maximize amodal relevance and diversity while minimizing occlusions. Additionally, ActiveVLA applies a 3D zoom-in to improve resolution in key areas. Together, these steps enable finer-grained active perception for precise manipulation. Extensive experiments demonstrate that ActiveVLA achieves precise 3D manipulation and outperforms state-of-the-art baselines on three simulation benchmarks. Moreover, ActiveVLA transfers seamlessly to real-world scenarios, enabling robots to learn high-precision tasks in complex environments.
翻译:机器人操作领域的最新进展利用了预训练的视觉-语言模型,并探索将三维空间信号集成到这些模型中以进行有效的动作预测,从而催生了前景广阔的视觉-语言-动作范式。然而,现有方法大多忽视了主动感知的重要性:它们通常依赖于静态的、安装在腕部的摄像头,该摄像头提供以末端执行器为中心的视角。因此,这些模型无法在执行任务期间自适应地选择最佳视角或分辨率,这极大地限制了它们在长时程任务和细粒度操作场景中的性能。为应对这些局限,我们提出了ActiveVLA,一种新颖的视觉-语言-动作框架,赋予机器人主动感知能力,以实现高精度、细粒度的操作。ActiveVLA采用由粗到精的范式,将过程分为两个阶段:(1) 关键区域定位。ActiveVLA将三维输入投影到多视角二维图像上,识别关键三维区域,并支持动态空间感知。(2) 主动感知优化。基于已定位的关键区域,ActiveVLA采用主动视角选择策略来选择最优视角。这些视角旨在最大化非模态相关性与多样性,同时最小化遮挡。此外,ActiveVLA应用三维局部放大以提高关键区域的分辨率。这些步骤共同实现了更细粒度的主动感知,以支持精确操作。大量实验表明,ActiveVLA实现了精确的三维操作,并在三个仿真基准测试中超越了现有最先进的基线方法。此外,ActiveVLA能够无缝迁移到现实世界场景,使机器人能够在复杂环境中学习高精度任务。