The use of multi-camera views simultaneously has been shown to improve the generalization capabilities and performance of visual policies. However, the hardware cost and design constraints in real-world scenarios can potentially make it challenging to use multiple cameras. In this study, we present a novel approach to enhance the generalization performance of vision-based Reinforcement Learning (RL) algorithms for robotic manipulation tasks. Our proposed method involves utilizing a technique known as knowledge distillation, in which a pre-trained ``teacher'' policy trained with multiple camera viewpoints guides a ``student'' policy in learning from a single camera viewpoint. To enhance the student policy's robustness against camera location perturbations, it is trained using data augmentation and extreme viewpoint changes. As a result, the student policy learns robust visual features that allow it to locate the object of interest accurately and consistently, regardless of the camera viewpoint. The efficacy and efficiency of the proposed method were evaluated both in simulation and real-world environments. The results demonstrate that the single-view visual student policy can successfully learn to grasp and lift a challenging object, which was not possible with a single-view policy alone. Furthermore, the student policy demonstrates zero-shot transfer capability, where it can successfully grasp and lift objects in real-world scenarios for unseen visual configurations.
翻译:同时使用多视角已被证明能提升视觉策略的泛化能力和性能。然而,实际应用场景中的硬件成本与设计约束可能使多相机部署面临挑战。本研究提出一种新颖方法,旨在增强基于视觉的强化学习算法在机器人操作任务中的泛化性能。我们的方法采用知识蒸馏技术:由多视角训练得到的预训练"教师"策略,引导"学生"策略从单视角进行学习。为提升学生策略对相机位置扰动的鲁棒性,我们通过数据增强和极端视角变化对其训练。最终,学生策略习得稳健的视觉特征,使其能精确且一致地定位目标物体,不受相机视角变化影响。本方法的效能与效率在仿真和真实环境中均得到验证。结果表明,单视角视觉学生策略能成功学习抓取并抬起具有挑战性的物体——这一目标仅凭单视角策略无法实现。此外,学生策略展现出零样本迁移能力,可在真实场景中成功抓取并抬起未见过视觉配置下的物体。