This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated evaluation benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding.
翻译:本文提出了ShapeLLM——首个专为具身交互设计的3D多模态大语言模型(LLM),探索了结合3D点云与语言的通用3D物体理解方法。ShapeLLM基于改进的3D编码器构建,通过将ReCon扩展为ReCon++,利用多视图图像蒸馏增强几何理解能力。通过采用ReCon++作为LLM的3D点云输入编码器,ShapeLLM在构建的指令跟随数据上训练,并在新人工精选的评估基准3D MM-Vet上测试。ReCon++与ShapeLLM在3D几何理解及语言统一的3D交互任务(如具身视觉定位)中均达到了最先进水平。