Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantly, their frameworks are mainly designed for 2D models, lacking a general any-to-3D paradigm. In this paper, we introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. Given a frozen transformer from any source modality, we propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables us to assign each 3D token with a positional encoding paired with the pre-trained model, which avoids 3D geometry loss caused by the true projection and better motivates the transformer for 3D learning with 1D/2D positional priors. Then, within each transformer block, we insert an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers. We conduct extensive experiments to showcase the effectiveness and efficiency of our method. Code and models are released at https://github.com/Ivan-Tang-3D/Any2Point.

翻译：近年来，大型基础模型成为备受关注的焦点，在广泛场景中取得了优异性能。由于三维数据的稀缺性，许多研究致力于将预训练的视觉Transformer适配至三维领域。然而，这类二维到三维的方法仍存在局限性：一方面可能造成空间几何信息丢失，另一方面计算成本高昂。更重要的是，这些框架主要针对二维模型设计，缺乏通用的任意到三维范式。本文提出Any2Point——一种参数高效方法，可赋能任意模态（视觉、语言、音频）大型模型实现三维理解。针对任意源模态的冻结Transformer，我们提出三维到任意（一维或二维）虚拟投影策略，将输入的三维点与源模态原始的一维或二维位置建立关联。该机制使每个三维Token能够获得与预训练模型配对的位姿编码，既避免了真实投影导致的三维几何损失，又能通过一维/二维位置先验更好地激励Transformer进行三维学习。随后，在每个Transformer模块内，我们插入任意到三维引导适配器模块实现参数高效微调。该适配器融合源模态的先验空间知识，引导三维Token的局部特征聚合，从而驱动任意模态Transformer的语义适配。大量实验证明了本方法的有效性与高效性。代码和模型已在https://github.com/Ivan-Tang-3D/Any2Point开源。