3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

The integration of language and 3D perception is crucial for embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is a lack of large-scale datasets with dense grounding between language and 3D scenes. We introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons of models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the importance of large-scale 3D-text datasets for embodied AI research. Our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with resources and insights to lead to more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io

翻译：语言与三维感知的融合对于理解并与物理世界交互的具身智能体与机器人至关重要。尽管大语言模型（LLMs）已展现出卓越的语言理解与生成能力，但其在三维环境中的适应（3D-LLMs）仍处于早期阶段。一个主要挑战是缺乏大规模、语言与三维场景之间具有密集对应关系的数据集。我们提出了3D-GRAND，这是一个开创性的大规模数据集，包含40,087个家居场景，并配有620万条密集关联的场景-语言指令。我们的结果表明，使用3D-GRAND进行指令微调能显著增强3D-LLMs的定位能力并减少幻觉。作为贡献的一部分，我们提出了一个综合性基准测试3D-POPE，用于系统评估3D-LLMs中的幻觉问题，从而实现对模型的公平比较。我们的实验揭示了数据集规模与3D-LLM性能之间的缩放效应，强调了大规模三维-文本数据集对具身人工智能研究的重要性。我们的结果展示了有效的模拟到现实迁移的早期信号，表明基于大规模合成数据训练的模型能够在真实世界三维扫描上表现良好。通过3D-GRAND和3D-POPE，我们旨在为具身人工智能社区提供资源与见解，以推动开发更可靠、定位更准确的3D-LLMs。项目网站：https://3d-grand.github.io

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日