Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantly, their frameworks are mainly designed for 2D models, lacking a general any-to-3D paradigm. In this paper, we introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. Given a frozen transformer from any source modality, we propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables us to assign each 3D token with a positional encoding paired with the pre-trained model, which avoids 3D geometry loss caused by the true projection and better motivates the transformer for 3D learning with 1D/2D positional priors. Then, within each transformer block, we insert an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers. We conduct extensive experiments to showcase the effectiveness and efficiency of our method. Code and models are released at https://github.com/Ivan-Tang-3D/Any2Point.

翻译：大型基础模型近来已成为备受关注的焦点，在广泛场景中取得了卓越性能。由于三维数据的稀缺性，许多研究致力于将预训练的Transformer从视觉领域迁移至三维领域。然而，这类二维到三维的方法仍存在局限，主要源于潜在的空间几何信息丢失及高昂的计算成本。更重要的是，现有框架主要针对二维模型设计，缺乏通用的任意模态到三维的范式。本文提出Any2Point，一种参数高效的方法，旨在赋能任意模态大模型（视觉、语言、音频）进行三维理解。给定来自任意源模态的冻结Transformer，我们提出一种三维到任意（一维或二维）虚拟投影策略，将输入的三维点与源模态中原始的一维或二维位置相关联。该机制使我们能够为每个三维标记分配与预训练模型配对的位置编码，从而避免真实投影导致的三维几何信息损失，并更好地利用一维/二维位置先验激励Transformer进行三维学习。随后，在每个Transformer模块中，我们插入任意到三维的引导适配器模块进行参数高效的微调。该适配器融合源模态的先验空间知识，以引导三维标记的局部特征聚合，促使任意模态Transformer完成语义适配。我们进行了大量实验，验证了所提方法的有效性和高效性。代码与模型发布于 https://github.com/Ivan-Tang-3D/Any2Point。