何种三维场景表征最适合机器人学？从几何方法到基础模型 (What Is The Best 3D Scene Representation for Robotics? From Geometric to Foundation Models)

Tianchen Deng,Yue Pan,Shenghai Yuan,Dong Li,Chen Wang,Mingrui Li,Long Chen,Lihua Xie,Danwei Wang,Jingchuan Wang,Javier Civera,Hesheng Wang,Weidong Chen

In this paper, we provide a comprehensive overview of existing scene representation methods for robotics, covering traditional representations such as point clouds, voxels, signed distance functions (SDF), and scene graphs, as well as more recent neural representations like Neural Radiance Fields (NeRF), 3D Gaussian Splatting (3DGS), and the emerging Foundation Models. While current SLAM and localization systems predominantly rely on sparse representations like point clouds and voxels, dense scene representations are expected to play a critical role in downstream tasks such as navigation and obstacle avoidance. Moreover, neural representations such as NeRF, 3DGS, and foundation models are well-suited for integrating high-level semantic features and language-based priors, enabling more comprehensive 3D scene understanding and embodied intelligence. In this paper, we categorized the core modules of robotics into five parts (Perception, Mapping, Localization, Navigation, Manipulation). We start by presenting the standard formulation of different scene representation methods and comparing the advantages and disadvantages of scene representation across different modules. This survey is centered around the question: What is the best 3D scene representation for robotics? We then discuss the future development trends of 3D scene representations, with a particular focus on how the 3D Foundation Model could replace current methods as the unified solution for future robotic applications. The remaining challenges in fully realizing this model are also explored. We aim to offer a valuable resource for both newcomers and experienced researchers to explore the future of 3D scene representations and their application in robotics. We have published an open-source project on GitHub and will continue to add new works and technologies to this project.

翻译：本文全面综述了机器人学中现有的场景表征方法，涵盖点云、体素、符号距离函数（SDF）和场景图等传统表征，以及神经辐射场（NeRF）、三维高斯泼溅（3DGS）和新兴的基础模型等神经表征。尽管当前的SLAM与定位系统主要依赖点云和体素等稀疏表征，稠密场景表征预计将在导航与避障等下游任务中发挥关键作用。此外，NeRF、3DGS和基础模型等神经表征非常适合集成高层语义特征和基于语言的先验知识，从而实现更全面的三维场景理解与具身智能。本文将机器人学的核心模块划分为五个部分（感知、建图、定位、导航、操作），首先给出不同场景表征方法的标准形式化描述，并比较各模块中不同场景表征的优缺点。本综述围绕一个核心问题展开：何种三维场景表征最适合机器人学？随后，我们讨论了三维场景表征的未来发展趋势，特别聚焦于三维基础模型如何作为统一解决方案取代现有方法，以应用于未来的机器人任务，同时也探讨了完全实现该模型所面临的挑战。我们旨在为新入门者和资深研究者提供有价值的参考资源，以探索三维场景表征的未来发展及其在机器人学中的应用。我们已在GitHub上发布开源项目，并将持续为此项目添加新的研究成果与技术。