We are witnessing significant progress on perception models, specifically those trained on large-scale internet images. However, efficiently generalizing these perception models to unseen embodied tasks is insufficiently studied, which will help various relevant applications (e.g., home robots). Unlike static perception methods trained on pre-collected images, the embodied agent can move around in the environment and obtain images of objects from any viewpoints. Therefore, efficiently learning the exploration policy and collection method to gather informative training samples is the key to this task. To do this, we first build a 3D semantic distribution map to train the exploration policy self-supervised by introducing the semantic distribution disagreement and the semantic distribution uncertainty rewards. Note that the map is generated from multi-view observations and can weaken the impact of misidentification from an unfamiliar viewpoint. Our agent is then encouraged to explore the objects with different semantic distributions across viewpoints, or uncertain semantic distributions. With the explored informative trajectories, we propose to select hard samples on trajectories based on the semantic distribution uncertainty to reduce unnecessary observations that can be correctly identified. Experiments show that the perception model fine-tuned with our method outperforms the baselines trained with other exploration policies. Further, we demonstrate the robustness of our method in real-robot experiments.
翻译:我们正目睹感知模型(尤其是那些在大规模互联网图像上训练的模型)取得显著进展。然而,如何将这些感知模型高效地泛化至未见过的具身任务,尚未得到充分研究,而这将助力多种相关应用(例如家用机器人)。与基于预收集图像的静态感知方法不同,具身智能体可在环境中移动并从任意视角获取物体图像。因此,高效学习探索策略与采集方法以收集信息丰富的训练样本,是该任务的关键所在。为此,我们首先构建一个三维语义分布地图,通过引入语义分布分歧与语义分布不确定性奖励,以自监督方式训练探索策略。该地图由多视角观测生成,能削弱陌生视角下错误识别的影响。随后,我们鼓励智能体探索那些跨视角具有不同语义分布或语义分布不确定的物体。基于探索所得的信息轨迹,我们提出根据语义分布不确定性从轨迹中选取困难样本,以减少可被正确识别的冗余观测。实验表明,采用我们的方法微调后的感知模型,优于使用其他探索策略训练的基线模型。此外,我们在真实机器人实验中验证了该方法的鲁棒性。