Deploying visual reinforcement learning (RL) policies in real-world manipulation is often hindered by camera viewpoint changes. A policy trained from a fixed front-facing camera may fail when the camera is shifted -- an unavoidable situation in real-world settings where sensor placement is hard to manage appropriately. Existing methods often rely on precise camera calibration or struggle with large perspective changes. To address these limitations, we propose ManiVID-3D, a novel 3D RL architecture designed for robotic manipulation, which learns view-invariant representations through self-supervised disentangled feature learning. The framework incorporates ViewNet, a lightweight yet effective module that automatically aligns point cloud observations from arbitrary viewpoints into a unified spatial coordinate system without the need for extrinsic calibration. Additionally, we develop an efficient GPU-accelerated batch rendering module capable of processing over 5000 frames per second, enabling large-scale training for 3D visual RL at unprecedented speeds. Extensive evaluation across 10 simulated and 5 real-world tasks demonstrates that our approach achieves a 40.6% higher success rate than state-of-the-art methods under viewpoint variations while using 80% fewer parameters. The system's robustness to severe perspective changes and strong sim-to-real performance highlight the effectiveness of learning geometrically consistent representations for scalable robotic manipulation in unstructured environments.
翻译:在真实世界操作中部署视觉强化学习策略常常因摄像机视角变化而受阻。在固定正对摄像机视角下训练的策略,当摄像机位置偏移时可能失效——这在传感器布置难以精确管理的真实场景中是不可避免的。现有方法通常依赖精确的摄像机标定,或难以应对大幅视角变化。为克服这些局限,我们提出了ManiVID-3D,一种专为机器人操作设计的新型3D强化学习架构,它通过自监督解耦特征学习来获取视角不变的表示。该框架包含ViewNet——一个轻量而高效的模块,能够自动将任意视角的点云观测对齐到统一的空间坐标系中,无需外部标定。此外,我们开发了一个高效的GPU加速批量渲染模块,每秒可处理超过5000帧,从而以前所未有的速度实现大规模3D视觉强化学习训练。在10个模拟任务和5个真实世界任务上的广泛评估表明,我们的方法在视角变化下的成功率比现有最优方法高出40.6%,同时参数量减少80%。该系统对剧烈视角变化的鲁棒性以及出色的仿真到现实迁移性能,凸显了学习几何一致表示对于非结构化环境中可扩展机器人操作的有效性。