Enabling Large Language Models (LLMs) to understand the 3D physical world is an emerging yet challenging research direction. Current strategies for processing point clouds typically downsample the scene or divide it into smaller parts for separate analysis. However, both approaches risk losing key local details or global contextual information. In this paper, we introduce PerLA, a 3D language assistant designed to be more perceptive to both details and context, making visual representations more informative for the LLM. PerLA captures high-resolution (local) details in parallel from different point cloud areas and integrates them with (global) context obtained from a lower-resolution whole point cloud. We present a novel algorithm that preserves point cloud locality through the Hilbert curve and effectively aggregates local-to-global information via cross-attention and a graph neural network. Lastly, we introduce a novel loss for local representation consensus to promote training stability. PerLA outperforms state-of-the-art 3D language assistants, with gains of up to +1.34 CiDEr on ScanQA for question answering, and +4.22 on ScanRefer and +3.88 on Nr3D for dense captioning.\url{https://gfmei.github.io/PerLA/}
翻译:使大型语言模型(LLM)理解三维物理世界是一个新兴且具有挑战性的研究方向。当前处理点云的策略通常会对场景进行下采样或将其分割为多个部分进行独立分析。然而,这两种方法均可能丢失关键的局部细节或全局上下文信息。本文提出PerLA,一种旨在对细节与上下文均具有更强感知能力的三维语言助手,从而使视觉表征对LLM更具信息量。PerLA并行地从不同点云区域捕获高分辨率(局部)细节,并将其与从低分辨率完整点云获得的(全局)上下文信息相融合。我们提出一种新颖算法,该算法通过希尔伯特曲线保持点云的局部性,并借助交叉注意力机制与图神经网络有效聚合局部至全局信息。最后,我们引入一种用于局部表征一致性的新型损失函数以提升训练稳定性。PerLA在多项任务上超越了当前最先进的三维语言助手:在ScanQA问答任务上CIDEr指标提升最高达+1.34,在ScanRefer和Nr3D密集描述任务上分别提升+4.22和+3.88。\url{https://gfmei.github.io/PerLA/}