Active perception and manipulation are crucial for robots to interact with complex scenes. Existing methods struggle to unify semantic-driven active perception with robust, viewpoint-invariant execution. We propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Our approach decouples camera and manipulation actions rather than placing them in a shared action space, and follows a bottom-up training strategy: we first train semantic camera control on a large-scale dataset, then jointly optimize both action types using hybrid data. To support this framework, we introduce ActiveViewPose-200K, a dataset of 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We also present ActiveManip-Bench, the first benchmark for evaluating active manipulation beyond fixed-view settings. Extensive experiments in both simulation and real-world environments show that SaPaVe outperforms recent vision-language-action models such as GR00T N1 and \(π_0\), achieving up to 31.25\% higher success rates in real-world tasks. These results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation. Project page: https://lmzpai.github.io/SaPaVe
翻译:主动感知与操作对于机器人与复杂场景交互至关重要。现有方法难以将语义驱动的主动感知与鲁棒、视角无关的执行相统一。我们提出SaPaVe,一种端到端框架,能够以数据高效的方式联合学习这些能力。我们的方法将相机动作与操作动作解耦,而非将其置于共享动作空间中,并遵循自底向上的训练策略:首先在大规模数据集上训练语义相机控制,随后利用混合数据联合优化两种动作类型。为支持该框架,我们引入了ActiveViewPose-200K数据集(包含20万个用于语义相机运动学习的图像-语言-相机运动配对数据)以及一个提升动态视角下执行鲁棒性的3D几何感知模块。我们还提出了ActiveManip-Bench,这是首个用于评估超越固定视角设置的主动操作的基准测试。在仿真与真实环境中的大量实验表明,SaPaVe在真实世界任务中取得了高达31.25%的成功率提升,性能优于近期视觉-语言-动作模型(如GR00T N1与\(π_0\))。这些结果表明,通过解耦但协调的策略进行训练,紧密耦合的感知与执行能够实现高效且可泛化的主动操作。项目页面:https://lmzpai.github.io/SaPaVe