Introducing BERT into cross-modal settings raises difficulties in its optimization for handling multiple modalities. Both the BERT architecture and training objective need to be adapted to incorporate and model information from different modalities. In this paper, we address these challenges by exploring the implicit semantic and geometric correlations between 2D and 3D data of the same objects/scenes. We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT. To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction. The first task, referred to as Point-Image Alignment, aims to align features between unimodal and cross-modal representations to capture the correspondences between the 2D and 3D modalities. The second task, termed Masked Cross-modal Modeling, further improves mask modeling of BERT by incorporating high-dimensional semantic information obtained by cross-modal interaction. By performing cross-modal interaction, Cross-BERT can smoothly reconstruct the masked tokens during pretraining, leading to notable performance enhancements for downstream tasks. Through empirical evaluation, we demonstrate that Cross-BERT outperforms existing state-of-the-art methods in 3D downstream applications. Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.
翻译:将BERT引入跨模态场景会增加其处理多种模态的优化难度。BERT架构与训练目标都需要调整,以整合并建模不同模态的信息。本文通过探索同一对象/场景的二维与三维数据之间的隐式语义和几何关联,提出一种名为Cross-BERT的新型跨模态BERT式自监督学习范式。为实现不规则稀疏点云的预训练,我们设计了两个自监督任务以增强跨模态交互:第一个任务称为"点-图像对齐",旨在对齐单模态与跨模态表示的特征,捕获二维与三维模态间的对应关系;第二个任务称为"掩蔽跨模态建模",通过融合跨模态交互获得的高维语义信息,进一步改进BERT的掩蔽建模能力。通过执行跨模态交互,Cross-BERT可在预训练过程中平滑重建掩蔽标记,显著提升下游任务的性能。实验评估表明,Cross-BERT在三维下游应用中优于现有最先进方法。本研究揭示了利用跨模态二维知识强化三维点云表示的有效性,以及BERT跨模态迁移能力的价值。