Latent scene representation plays a significant role in training reinforcement learning (RL) agents. To obtain good latent vectors describing the scenes, recent works incorporate the 3D-aware latent-conditioned NeRF pipeline into scene representation learning. However, these NeRF-related methods struggle to perceive 3D structural information due to the inefficient dense sampling in volumetric rendering. Moreover, they lack fine-grained semantic information included in their scene representation vectors because they evenly consider free and occupied spaces. Both of them can destroy the performance of downstream RL tasks. To address the above challenges, we propose a novel framework that adopts the efficient 3D Gaussian Splatting (3DGS) to learn 3D scene representation for the first time. In brief, we present the Query-based Generalizable 3DGS to bridge the 3DGS technique and scene representations with more geometrical awareness than those in NeRFs. Moreover, we present the Hierarchical Semantics Encoding to ground the fine-grained semantic features to 3D Gaussians and further distilled to the scene representation vectors. We conduct extensive experiments on two RL platforms including Maniskill2 and Robomimic across 10 different tasks. The results show that our method outperforms the other 5 baselines by a large margin. We achieve the best success rates on 8 tasks and the second-best on the other two tasks.
翻译:潜在场景表示在训练强化学习(RL)智能体中扮演着重要角色。为获得描述场景的良好潜在向量,近期研究将3D感知的潜在条件神经辐射场(NeRF)流程融入场景表示学习。然而,这些基于NeRF的方法因体渲染中低效的密集采样而难以感知3D结构信息。此外,由于对自由空间和占用空间进行均等处理,其场景表示向量缺乏细粒度语义信息。这两方面均可能损害下游RL任务的性能。为解决上述挑战,我们首次提出采用高效3D高斯泼溅(3DGS)学习3D场景表示的新框架。简言之,我们提出基于查询的可泛化3DGS,以比NeRF方法更具几何感知的方式搭建3DGS技术与场景表示之间的桥梁。此外,我们提出分层语义编码,将细粒度语义特征锚定至3D高斯分布,并进一步蒸馏到场景表示向量中。我们在Maniskill2和Robomimic两个RL平台上对10项不同任务进行了广泛实验。结果表明,我们的方法大幅优于其他5个基线模型,在8项任务中取得最佳成功率,在其余两项任务中位列第二。