We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalities. We propose a methodology named Multimodal Pathway - given a target modality and a transformer designed for it, we use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models so that data of the target modality can be processed by both models. In this way, we utilize the universal sequence-to-sequence modeling abilities of transformers obtained from two modalities. As a concrete implementation, we use a modality-specific tokenizer and task-specific head as usual but utilize the transformer blocks of the auxiliary model via a proposed method named Cross-Modal Re-parameterization, which exploits the auxiliary weights without any inference costs. On the image, point cloud, video, and audio recognition tasks, we observe significant and consistent performance improvements with irrelevant data from other modalities. The code and models are available at https://github.com/AILab-CVC/M2PT.
翻译:我们提出了一种方法,利用其他模态的无关数据来改进特定模态的Transformer模型,例如使用音频或点云数据集改进ImageNet模型。需要强调的是,目标模态的数据样本与其他模态数据无关,这使我们的方法区别于其他利用配对数据(如CLIP)或不同模态交错数据的工作。我们提出了一种名为“多模态路径”的方法——给定目标模态及其对应的Transformer模型,我们使用另一个经过其他模态数据训练的辅助Transformer,并构建路径连接两个模型的组件,使得目标模态的数据能够被两个模型共同处理。通过这种方式,我们利用了Transformer从两种模态中获得的通用序列到序列建模能力。作为具体实现,我们照常使用模态特定分词器和任务特定头部,但通过提出的一种名为“跨模态重参数化”的方法利用辅助模型的Transformer模块,该方法在不产生任何推理成本的情况下充分利用辅助权重。在图像、点云、视频和音频识别任务中,我们观察到使用其他模态的无关数据能带来显著且一致的性能提升。代码和模型已开源在 https://github.com/AILab-CVC/M2PT。