Recent studies on visual reinforcement learning (visual RL) have explored the use of 3D visual representations. However, none of these work has systematically compared the efficacy of 3D representations with 2D representations across different tasks, nor have they analyzed 3D representations from the perspective of agent-object / object-object relationship reasoning. In this work, we seek answers to the question of when and how do 3D neural networks that learn features in the 3D-native space provide a beneficial inductive bias for visual RL. We specifically focus on 3D point clouds, one of the most common forms of 3D representations. We systematically investigate design choices for 3D point cloud RL, leading to the development of a robust algorithm for various robotic manipulation and control tasks. Furthermore, through comparisons between 2D image vs 3D point cloud RL methods on both minimalist synthetic tasks and complex robotic manipulation tasks, we find that 3D point cloud RL can significantly outperform the 2D counterpart when agent-object / object-object relationship encoding is a key factor.
翻译:近期关于视觉强化学习(visual RL)的研究已开始探索三维视觉表征的运用,但尚无工作系统性地比较三维表征与二维表征在不同任务中的效能差异,也未从智能体-物体/物体-物体关系推理的角度分析三维表征。本研究旨在回答以下问题:在何种情况下、以何种方式,在三维原生空间中学习特征的三维神经网络能为视觉强化学习提供有益的归纳偏置?我们聚焦于三维点云这一最普遍的三维表征形式,系统研究三维点云强化学习的设计选择,从而为多种机器人操作与控制任务开发出稳健的算法。进一步地,通过对比基于二维图像与三维点云的强化学习方法在极简合成任务及复杂机器人操作任务中的表现,我们发现当智能体-物体/物体-物体关系编码是关键因素时,三维点云强化学习能显著超越其二维对应方法。