We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalities. We propose a methodology named Multimodal Pathway - given a target modality and a transformer designed for it, we use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models so that data of the target modality can be processed by both models. In this way, we utilize the universal sequence-to-sequence modeling abilities of transformers obtained from two modalities. As a concrete implementation, we use a modality-specific tokenizer and task-specific head as usual but utilize the transformer blocks of the auxiliary model via a proposed method named Cross-Modal Re-parameterization, which exploits the auxiliary weights without any inference costs. On the image, point cloud, video, and audio recognition tasks, we observe significant and consistent performance improvements with irrelevant data from other modalities. The code and models are available at https://github.com/AILab-CVC/M2PT.
翻译:我们提出利用其他模态的无关数据来改进特定模态的Transformer,例如利用音频或点云数据集改进ImageNet模型。需要强调的是,目标模态的数据样本与其他模态无关——这使我们的方法区别于利用配对数据(如CLIP)或跨模态交错数据的相关工作。我们提出一种名为"多模态路径"的方法论:给定目标模态及其对应的Transformer架构,我们使用另一个通过其他模态数据训练的辅助Transformer,构建连接两个模型组件的通路,使得目标模态数据能够被两个模型共同处理。通过这种方式,我们利用Transformer从两种模态中获得的通用序列到序列建模能力。具体实现中,我们沿用常规的模态特定分词器与任务特定编码器,但通过提出的"跨模态重参数化"方法利用辅助模型的Transformer模块,该方法可在无推理开销的情况下利用辅助权重。在图像、点云、视频和音频识别任务上,我们观察到使用其他模态的无关数据能带来显著且一致的性能提升。代码与模型已开源至 https://github.com/AILab-CVC/M2PT。