The integration of language and 3D perception is crucial for embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is a lack of large-scale datasets with dense grounding between language and 3D scenes. We introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons of models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the importance of large-scale 3D-text datasets for embodied AI research. Our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with resources and insights to lead to more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io
翻译:语言与三维感知的融合对于理解并与物理世界交互的具身智能体与机器人至关重要。尽管大语言模型(LLMs)已展现出卓越的语言理解与生成能力,但其在三维环境中的适应(3D-LLMs)仍处于早期阶段。一个主要挑战是缺乏大规模、语言与三维场景之间具有密集对应关系的数据集。我们提出了3D-GRAND,这是一个开创性的大规模数据集,包含40,087个家居场景,并配有620万条密集关联的场景-语言指令。我们的结果表明,使用3D-GRAND进行指令微调能显著增强3D-LLMs的定位能力并减少幻觉。作为贡献的一部分,我们提出了一个综合性基准测试3D-POPE,用于系统评估3D-LLMs中的幻觉问题,从而实现对模型的公平比较。我们的实验揭示了数据集规模与3D-LLM性能之间的缩放效应,强调了大规模三维-文本数据集对具身人工智能研究的重要性。我们的结果展示了有效的模拟到现实迁移的早期信号,表明基于大规模合成数据训练的模型能够在真实世界三维扫描上表现良好。通过3D-GRAND和3D-POPE,我们旨在为具身人工智能社区提供资源与见解,以推动开发更可靠、定位更准确的3D-LLMs。项目网站:https://3d-grand.github.io