The unprecedented advancements in Large Language Models (LLMs) have shown a profound impact on natural language processing but are yet to fully embrace the realm of 3D understanding. This paper introduces PointLLM, a preliminary effort to fill this gap, enabling LLMs to understand point clouds and offering a new avenue beyond 2D visual data. PointLLM understands colored object point clouds with human instructions and generates contextually appropriate responses, illustrating its grasp of point clouds and common sense. Specifically, it leverages a point cloud encoder with a powerful LLM to effectively fuse geometric, appearance, and linguistic information. We collect a novel dataset comprising 660K simple and 70K complex point-text instruction pairs to enable a two-stage training strategy: aligning latent spaces and subsequently instruction-tuning the unified model. To rigorously evaluate the perceptual and generalization capabilities of PointLLM, we establish two benchmarks: Generative 3D Object Classification and 3D Object Captioning, assessed through three different methods, including human evaluation, GPT-4/ChatGPT evaluation, and traditional metrics. Experimental results reveal PointLLM's superior performance over existing 2D and 3D baselines, with a notable achievement in human-evaluated object captioning tasks where it surpasses human annotators in over 50% of the samples. Codes, datasets, and benchmarks are available at https://github.com/OpenRobotLab/PointLLM .
翻译:大语言模型(LLMs)的空前进步对自然语言处理产生了深远影响,但尚未完全融入三维理解领域。本文介绍PointLLM——一项填补这一空白的初步探索,旨在使大语言模型能够理解点云,并提供超越二维视觉数据的新途径。PointLLM能根据人类指令理解彩色物体点云,并生成符合上下文的合理响应,体现了其对点云及常识的掌握能力。具体而言,该方法利用点云编码器与强大的大语言模型,有效融合几何、外观和语言信息。我们收集了一个包含66万条简单点云文本指令对和7万条复杂点云文本指令对的新型数据集,以支持两阶段训练策略:首先对齐潜在空间,随后对统一模型进行指令微调。为严格评估PointLLM的感知与泛化能力,我们构建了两个基准测试:生成式三维物体分类和三维物体描述,并通过三种不同方法(包括人工评估、GPT-4/ChatGPT评估及传统指标)进行评测。实验结果表明,PointLLM在性能上优于现有二维和三维基线模型,在人工评估的物体描述任务中表现突出——超过50%的样本超越了人类标注员。相关代码、数据集和基准测试已发布于https://github.com/OpenRobotLab/PointLLM。