The recent development of Large Language Models (LLMs) with strong reasoning ability has driven research in various domains such as mathematics, coding, and scientific discovery. Meanwhile, 3D visual grounding, as a fundamental task in 3D understanding, still remains challenging due to the limited reasoning ability of recent 3D visual grounding models. Most of the current methods incorporate a text encoder and visual feature encoder to generate cross-modal fuse features and predict the referring object. These models often require supervised training on extensive 3D annotation data. On the other hand, recent research also focus on scaling synthetic data to train stronger 3D visual grounding LLM, however, the performance gain remains limited and non-proportional to the data collection cost. In this work, we propose a 3D visual grounding data pipeline, which is capable of automatically synthesizing 3D visual grounding data along with corresponding reasoning process. Additionally, we leverage the generated data for LLM fine-tuning and introduce Reason3DVG-8B, a strong 3D visual grounding LLM that outperforms previous LLM-based method 3D-GRAND using only 1.6% of their training data, demonstrating the effectiveness of our data and the importance of reasoning in 3D visual grounding.
翻译:近年来,具备强大推理能力的大语言模型(LLMs)的发展推动了数学、编程和科学发现等多个领域的研究。与此同时,作为三维理解基础任务的三维视觉定位,由于现有三维视觉定位模型推理能力有限,仍然面临挑战。当前大多数方法通过结合文本编码器和视觉特征编码器来生成跨模态融合特征并预测所指物体。这些模型通常需要在大量三维标注数据上进行监督训练。另一方面,近期研究也关注通过扩展合成数据来训练更强的三维视觉定位大语言模型,然而性能提升有限,且与数据收集成本不成正比。本研究提出一种三维视觉定位数据生成流程,能够自动合成三维视觉定位数据及其对应的推理过程。此外,我们利用生成的数据对大语言模型进行微调,并提出了Reason3DVG-8B——一个强大的三维视觉定位大语言模型。该模型仅使用3D-GRAND训练数据量的1.6%,就在性能上超越了先前基于大语言模型的方法,这证明了我们生成数据的有效性以及推理能力在三维视觉定位中的重要性。