Recent advances in Large Multimodal Models (LMM) have made it possible for various applications in human-machine interactions. However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene. Existing works seek help from multi-view images, and project 2D features to 3D space as 3D scene representations. This, however, leads to huge computational overhead and performance degradation. In this paper, we present LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and respond to both textual-instructions and visual-prompts. This help LMMs better comprehend human interactions and further help to remove the ambiguities in cluttered 3D scenes. Experiments show that LL3DA achieves remarkable results, and surpasses various 3D vision-language models on both 3D Dense Captioning and 3D Question Answering.
翻译:近期大型多模态模型(LMM)的进展使得人机交互领域的多种应用成为可能。然而,在复杂多样的3D环境中开发能够理解、推理与规划的LMM仍具挑战性,尤其是考虑到对3D场景置换不变点云表示的理解需求。现有研究借助多视角图像将2D特征投影至3D空间作为3D场景表征,但这会导致巨大的计算开销和性能下降。本文提出LL3DA——一种以点云为直接输入、并能响应文本指令与视觉提示的大型语言3D助手。该模型有助于LMM更好地理解人类交互行为,进而消除杂乱3D场景中的歧义性。实验表明,LL3DA在3D密集描述和3D问答任务上均取得显著成果,超越了多种3D视觉语言模型。