The rising importance of 3D understanding, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach. Our code and models are available at https://github.com/Mr-Neko/JM3D.
翻译:三维理解在计算机视觉、自动驾驶和机器人领域的重要性日益凸显。然而,当前一种直接将二维对齐策略迁移至三维领域的普遍趋势面临三大挑战:(1)信息退化:将三维数据仅与单视图二维图像及通用文本对齐,忽视了多视角图像与详细子类别文本的需求;(2)协同不足:此类策略将三维表示分别与图像和文本特征对齐,阻碍了三维模型的整体优化;(3)利用不充分:所学表示中蕴含的细粒度信息往往未被充分挖掘,导致细节潜在损失。为应对这些问题,我们提出JM3D——一种融合点云、文本与图像的综合方法。核心贡献包括结构化多模态组织器(SMO),通过多视角与分层文本增强视觉-语言表征;以及联合多模态对齐(JMA),整合语言理解与视觉表征。我们的先进模型JM3D-LLM通过高效微调将三维表示与大语言模型结合。在ModelNet40和ScanObjectNN上的评估验证了JM3D的优越性。JM3D-LLM的卓越性能进一步彰显了本表征迁移方法的有效性。我们的代码与模型已开源至https://github.com/Mr-Neko/JM3D。