The unprecedented advancements in Large Language Models (LLMs) have created a profound impact on natural language processing but are yet to fully embrace the realm of 3D understanding. This paper introduces PointLLM, a preliminary effort to fill this gap, thereby enabling LLMs to understand point clouds and offering a new avenue beyond 2D visual data. PointLLM processes colored object point clouds with human instructions and generates contextually appropriate responses, illustrating its grasp of point clouds and common sense. Specifically, it leverages a point cloud encoder with a powerful LLM to effectively fuse geometric, appearance, and linguistic information. We collect a novel dataset comprising 660K simple and 70K complex point-text instruction pairs to enable a two-stage training strategy: initially aligning latent spaces and subsequently instruction-tuning the unified model. To rigorously evaluate our model's perceptual abilities and its generalization capabilities, we establish two benchmarks: Generative 3D Object Classification and 3D Object Captioning, assessed through three different methods, including human evaluation, GPT-4/ChatGPT evaluation, and traditional metrics. Experiment results show that PointLLM demonstrates superior performance over existing 2D baselines. Remarkably, in human-evaluated object captioning tasks, PointLLM outperforms human annotators in over 50% of the samples. Codes, datasets, and benchmarks are available at https://github.com/OpenRobotLab/PointLLM .
翻译:大型语言模型(LLMs)的空前进展在自然语言处理领域产生了深远影响,但尚未全面融入三维理解领域。本文介绍PointLLM——一项填补这一空白的初步探索,旨在使LLMs能够理解点云,并为其开辟超越二维视觉数据的新路径。PointLLM处理带颜色物体点云及人类指令,生成上下文恰当的回复,展现其对点云与常识的理解。具体而言,它利用点云编码器与强大的LLM,有效融合几何、外观与语言信息。我们收集了包含66万简单指令对和7万复杂指令对的新颖数据集,以支持两阶段训练策略:先对齐潜在空间,再对统一模型进行指令微调。为严格评估模型的感知能力与泛化性,我们建立了两大基准:生成式3D物体分类与3D物体描述,采用包括人类评估、GPT-4/ChatGPT评估及传统指标在内的三种方法进行评测。实验结果表明,PointLLM在性能上超越现有二维基线。值得注意的是,在人类评估的物体描述任务中,PointLLM在超过50%的样本上优于人类标注者。代码、数据集与基准已发布于https://github.com/OpenRobotLab/PointLLM。