Robotic manipulation continues to be a challenge, and imitation learning (IL) enables robots to learn tasks from expert demonstrations. Current IL methods typically rely on fixed camera setups, where cameras are manually positioned in static locations, imposing significant limitations on adaptability and coverage. Inspired by human active perception, where humans dynamically adjust their viewpoint to capture the most relevant and least noisy information, we propose MAE-Select, a novel framework for active viewpoint selection in single-camera robotic systems. MAE-Select fully leverages pre-trained multi-view masked autoencoder representations and dynamically selects the next most informative viewpoint at each time chunk without requiring labeled viewpoints. Extensive experiments demonstrate that MAE-Select improves the capabilities of single-camera systems and, in some cases, even surpasses multi-camera setups. The project will be available at https://mae-select.github.io.
翻译:机器人操控仍然是一个挑战,而模仿学习(IL)使机器人能够从专家演示中学习任务。当前的IL方法通常依赖于固定的相机设置,其中相机被手动放置在静态位置,这极大地限制了适应性和覆盖范围。受人类主动感知的启发——人类动态调整视角以捕捉最相关且噪声最少的信息——我们提出了MAE-Select,一种用于单相机机器人系统中主动视角选择的新颖框架。MAE-Select充分利用预训练的多视角掩码自编码器表示,并在每个时间块动态选择下一个信息量最大的视角,而无需标注的视角数据。大量实验表明,MAE-Select提升了单相机系统的能力,在某些情况下甚至超越了多相机设置。项目将在https://mae-select.github.io上公开。