PointLLM: Empowering Large Language Models to Understand Point Clouds

from arxiv, 19 pages. Empowering large language models with 3D point cloud understanding, accompanied by a novel dataset and carefully designed benchmarks. Project page: https://runsenxu.com/projects/PointLLM

The unprecedented advancements in Large Language Models (LLMs) have created a profound impact on natural language processing but are yet to fully embrace the realm of 3D understanding. This paper introduces PointLLM, a preliminary effort to fill this gap, thereby enabling LLMs to understand point clouds and offering a new avenue beyond 2D visual data. PointLLM processes colored object point clouds with human instructions and generates contextually appropriate responses, illustrating its grasp of point clouds and common sense. Specifically, it leverages a point cloud encoder with a powerful LLM to effectively fuse geometric, appearance, and linguistic information. We collect a novel dataset comprising 660K simple and 70K complex point-text instruction pairs to enable a two-stage training strategy: initially aligning latent spaces and subsequently instruction-tuning the unified model. To rigorously evaluate our model's perceptual abilities and its generalization capabilities, we establish two benchmarks: Generative 3D Object Classification and 3D Object Captioning, assessed through three different methods, including human evaluation, GPT-4/ChatGPT evaluation, and traditional metrics. Experiment results show that PointLLM demonstrates superior performance over existing 2D baselines. Remarkably, in human-evaluated object captioning tasks, PointLLM outperforms human annotators in over 50% of the samples. Codes, datasets, and benchmarks are available at https://github.com/OpenRobotLab/PointLLM .

翻译：大型语言模型（LLMs）的空前进展在自然语言处理领域产生了深远影响，但尚未全面融入三维理解领域。本文介绍PointLLM——一项填补这一空白的初步探索，旨在使LLMs能够理解点云，并为其开辟超越二维视觉数据的新路径。PointLLM处理带颜色物体点云及人类指令，生成上下文恰当的回复，展现其对点云与常识的理解。具体而言，它利用点云编码器与强大的LLM，有效融合几何、外观与语言信息。我们收集了包含66万简单指令对和7万复杂指令对的新颖数据集，以支持两阶段训练策略：先对齐潜在空间，再对统一模型进行指令微调。为严格评估模型的感知能力与泛化性，我们建立了两大基准：生成式3D物体分类与3D物体描述，采用包括人类评估、GPT-4/ChatGPT评估及传统指标在内的三种方法进行评测。实验结果表明，PointLLM在性能上超越现有二维基线。值得注意的是，在人类评估的物体描述任务中，PointLLM在超过50%的样本上优于人类标注者。代码、数据集与基准已发布于https://github.com/OpenRobotLab/PointLLM。

相关内容

点云

关注 50

根据激光测量原理得到的点云，包括三维坐标（XYZ）和激光反射强度（Intensity）。根据摄影测量原理得到的点云，包括三维坐标（XYZ）和颜色信息（RGB）。结合激光测量和摄影测量原理得到点云，包括三维坐标（XYZ）、激光反射强度（Intensity）和颜色信息（RGB）。在获取物体表面每个采样点的空间坐标后，得到的是一个点的集合，称之为“点云”(Point Cloud)

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日