Latent scene representation plays a significant role in training reinforcement learning (RL) agents. To obtain good latent vectors describing the scenes, recent works incorporate the 3D-aware latent-conditioned NeRF pipeline into scene representation learning. However, these NeRF-related methods struggle to perceive 3D structural information due to the inefficient dense sampling in volumetric rendering. Moreover, they lack fine-grained semantic information included in their scene representation vectors because they evenly consider free and occupied spaces. Both of them can destroy the performance of downstream RL tasks. To address the above challenges, we propose a novel framework that adopts the efficient 3D Gaussian Splatting (3DGS) to learn 3D scene representation for the first time. In brief, we present the Query-based Generalizable 3DGS to bridge the 3DGS technique and scene representations with more geometrical awareness than those in NeRFs. Moreover, we present the Hierarchical Semantics Encoding to ground the fine-grained semantic features to 3D Gaussians and further distilled to the scene representation vectors. We conduct extensive experiments on two RL platforms including Maniskill2 and Robomimic across 10 different tasks. The results show that our method outperforms the other 5 baselines by a large margin. We achieve the best success rates on 8 tasks and the second-best on the other two tasks.
翻译:潜在场景表示在训练强化学习智能体中起着重要作用。为获得描述场景的良好潜在向量,近期研究将三维感知的潜在条件神经辐射场流程融入场景表示学习。然而,这些神经辐射场相关方法因体渲染中低效的密集采样而难以感知三维结构信息。此外,由于对自由空间与占据空间进行均等处理,其场景表示向量缺乏细粒度语义信息。这两方面缺陷均可能损害下游强化学习任务的性能。为解决上述挑战,我们首次提出采用高效三维高斯泼溅技术学习三维场景表示的新框架。简言之,我们提出基于查询的可泛化三维高斯泼溅方法,以三维高斯泼溅技术为桥梁构建比神经辐射场更具几何感知能力的场景表示。此外,我们提出分层语义编码方法,将细粒度语义特征锚定至三维高斯分布,并进一步蒸馏到场景表示向量中。我们在Maniskill2和Robomimic两个强化学习平台上对10项不同任务开展广泛实验。结果表明,我们的方法以显著优势超越其他5个基线模型,在8项任务中取得最佳成功率,在其余两项任务中位列第二。