A General One-Shot Multimodal Active Perception Framework for Robotic Manipulation: Learning to Predict Optimal Viewpoint

Active perception in vision-based robotic manipulation aims to move the camera toward more informative observation viewpoints, thereby providing high-quality perceptual inputs for downstream tasks. Most existing active perception methods rely on iterative optimization, leading to high time and motion costs, and are tightly coupled with task-specific objectives, which limits their transferability. In this paper, we propose a general one-shot multimodal active perception framework for robotic manipulation. The framework enables direct inference of optimal viewpoints and comprises a data collection pipeline and an optimal viewpoint prediction network. Specifically, the framework decouples viewpoint quality evaluation from the overall architecture, supporting heterogeneous task requirements. Optimal viewpoints are defined through systematic sampling and evaluation of candidate viewpoints, after which large-scale training datasets are constructed via domain randomization. Moreover, a multimodal optimal viewpoint prediction network is developed, leveraging cross-attention to align and fuse multimodal features and directly predict camera pose adjustments. The proposed framework is instantiated in robotic grasping under viewpoint-constrained environments. Experimental results demonstrate that active perception guided by the framework significantly improves grasp success rates. Notably, real-world evaluations achieve nearly double the grasp success rate and enable seamless sim-to-real transfer without additional fine-tuning, demonstrating the effectiveness of the proposed framework.

翻译：基于视觉的机器人操作中的主动感知旨在将相机移动到更具信息量的观测视角，从而为下游任务提供高质量的感知输入。现有的大多数主动感知方法依赖于迭代优化，导致较高的时间和运动成本，并且与特定任务目标紧密耦合，这限制了其可迁移性。在本文中，我们提出了一种面向机器人操作的单次多模态主动感知通用框架。该框架能够直接推断最优观测视角，并包含一个数据收集流程和一个最优视角预测网络。具体而言，该框架将视角质量评估从整体架构中解耦，以支持异构任务需求。最优视角通过对候选视角进行系统采样和评估来定义，随后通过领域随机化构建大规模训练数据集。此外，我们开发了一个多模态最优视角预测网络，利用交叉注意力机制对齐和融合多模态特征，并直接预测相机位姿调整。所提出的框架在视角受限环境下的机器人抓取任务中进行了实例化。实验结果表明，由该框架引导的主动感知显著提高了抓取成功率。值得注意的是，真实世界评估实现了接近翻倍的抓取成功率，并且无需额外微调即可实现从仿真到现实的无缝迁移，这证明了所提出框架的有效性。