Perceiving the environment via cameras is crucial for Reinforcement Learning (RL) in robotics. While images are a convenient form of representation, they often complicate extracting important geometric details, especially with varying geometries or deformable objects. In contrast, point clouds naturally represent this geometry and easily integrate color and positional data from multiple camera views. However, while deep learning on point clouds has seen many recent successes, RL on point clouds is under-researched, with only the simplest encoder architecture considered in the literature. We introduce PointPatchRL (PPRL), a method for RL on point clouds that builds on the common paradigm of dividing point clouds into overlapping patches, tokenizing them, and processing the tokens with transformers. PPRL provides significant improvements compared with other point-cloud processing architectures previously used for RL. We then complement PPRL with masked reconstruction for representation learning and show that our method outperforms strong model-free and model-based baselines on image observations in complex manipulation tasks containing deformable objects and variations in target object geometry. Videos and code are available at https://alrhub.github.io/pprl-website
翻译:通过摄像头感知环境对于机器人领域的强化学习至关重要。虽然图像是一种便捷的表示形式,但它们往往使提取重要几何细节变得复杂,特别是在处理几何形状变化或可变形物体时。相比之下,点云天然地表示这种几何结构,并能轻松整合来自多个相机视角的颜色与位置数据。然而,尽管点云深度学习近年来取得诸多进展,点云强化学习的研究仍显不足,现有文献仅考虑了最简单的编码器架构。本文提出PointPatchRL(PPRL),这是一种基于点云的强化学习方法,其构建于将点云分割为重叠块、进行标记化处理并通过Transformer处理标记的通用范式之上。与先前用于强化学习的其他点云处理架构相比,PPRL展现出显著性能提升。我们进一步结合掩码重建进行表征学习,实验表明在包含可变形物体及目标物体几何形态变化的复杂操控任务中,本方法在图像观测条件下优于强模型无关与基于模型的基线方法。演示视频与代码发布于 https://alrhub.github.io/pprl-website