When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

Xianzheng Ma,Yash Bhalgat,Brandon Smart,Shuai Chen,Xinghui Li,Jian Ding,Jindong Gu,Dave Zhenyu Chen,Songyou Peng,Jia-Wang Bian,Philip H Torr,Marc Pollefeys,Matthias Nießner,Ian D Reid,Angel X. Chang,Iro Laina,Victor Adrian Prisacariu

As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: https://github.com/ActiveVisionLab/Awesome-LLM-3D.

翻译：随着大语言模型（LLMs）的演进，其与三维空间数据（3D-LLMs）的融合取得了快速进展，为理解和交互物理空间提供了前所未有的能力。本综述全面梳理了使LLMs能够处理、理解和生成3D数据的方法体系。通过强调LLMs的独特优势，如上下文学习、逐步推理、开放词汇能力及广泛的世界知识，我们揭示了它们在具身人工智能（AI）系统中显著推进空间理解与交互的潜力。我们的研究覆盖了从点云到神经辐射场（NeRFs）的多种3D数据表示形式，并考察了它们与LLMs在3D场景理解、描述生成、问答及对话等任务中的集成，以及基于LLM的智能体在空间推理、规划与导航中的应用。本文还简要评述了其他融合3D与语言的方法。通过元分析，本文揭示了显著进展，同时强调需要创新方法以充分释放3D-LLMs的潜力。因此，我们旨在为未来研究绘制路线图，探索并拓展3D-LLMs在理解与交互复杂3D世界方面的能力。为支撑本综述，我们建立了专题页面，系统整理并列举了与主题相关的论文：https://github.com/ActiveVisionLab/Awesome-LLM-3D。