Multimodal Large Language Models (MLLMs) have excelled in 2D image-text comprehension and image generation, but their understanding of the 3D world is notably deficient, limiting progress in 3D language understanding and generation. To solve this problem, we introduce GPT4Point, an innovative groundbreaking point-language multimodal model designed specifically for unified 3D object understanding and generation within the MLLM framework. GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text reference tasks such as point-cloud captioning and Q&A. Additionally, GPT4Point is equipped with advanced capabilities for controllable 3D generation, it can get high-quality results through a low-quality point-text feature maintaining the geometric shapes and colors. To support the expansive needs of 3D object-text pairs, we develop Pyramid-XL, a point-language dataset annotation engine. It constructs a large-scale database over 1M objects of varied text granularity levels from the Objaverse-XL dataset, essential for training GPT4Point. A comprehensive benchmark has been proposed to evaluate 3D point-language understanding capabilities. In extensive evaluations, GPT4Point has demonstrated superior performance in understanding and generation.
翻译:多模态大语言模型(MLLMs)在二维图像-文本理解与图像生成方面表现卓越,但在三维世界的理解上存在明显不足,限制了三维语言理解与生成领域的发展。为解决此问题,我们提出GPT4Point——一种创新的突破性点云-语言多模态模型,专门设计用于在MLLM框架内实现统一的3D物体理解与生成。作为强大的3D MLLM,GPT4Point能够无缝执行多种点云-文本关联任务,如点云描述与问答。此外,GPT4Point具备可控3D生成的高级能力,可通过低质量点云-文本特征获得高质量结果,同时保持几何形状与色彩。为满足大规模3D物体-文本对的需求,我们开发了Pyramid-XL——点云-语言数据集标注引擎。该引擎基于Objaverse-XL数据集构建了包含超过100万个物体且文本粒度层次多样的大规模数据库,对训练GPT4Point至关重要。我们提出了综合基准以评估三维点云-语言理解能力。在广泛评估中,GPT4Point在理解与生成任务上均展现出卓越性能。