Pre-training technique has gained tremendous success in enhancing model performance on various tasks, but found to perform worse than training from scratch in some uni-modal situations. This inspires us to think: are the pre-trained models always effective in the more complex multi-modal scenario, especially for the heterogeneous modalities such as audio and visual ones? We find that the answer is No. Specifically, we explore the effects of pre-trained models on two audio-visual learning scenarios: cross-modal initialization and multi-modal joint learning. When cross-modal initialization is applied, the phenomena of "dead channel" caused by abnormal Batchnorm parameters hinders the utilization of model capacity. Thus, we propose Adaptive Batchnorm Re-initialization (ABRi) to better exploit the capacity of pre-trained models for target tasks. In multi-modal joint learning, we find a strong pre-trained uni-modal encoder would bring negative effects on the encoder of another modality. To alleviate such problem, we introduce a two-stage Fusion Tuning strategy, taking better advantage of the pre-trained knowledge while making the uni-modal encoders cooperate with an adaptive masking method. The experiment results show that our methods could further exploit pre-trained models' potential and boost performance in audio-visual learning.
翻译:预训练技术在提升各类任务模型性能方面取得了巨大成功,但在某些单模态场景中却表现出比从头训练更差的效果。这促使我们思考:在更复杂的多模态场景中,尤其是音频与视觉等异质模态的联合学习里,预训练模型是否始终有效?我们发现答案是否定的。具体而言,我们探究了预训练模型在两种音频-视觉学习场景中的作用:跨模态初始化与多模态联合学习。当采用跨模态初始化时,由异常批归一化参数导致的“死通道”现象阻碍了模型容量的有效利用。为此,我们提出自适应批归一化重初始化方法,以更充分地挖掘预训练模型在目标任务中的潜力。在多模态联合学习中,我们发现强预训练的单模态编码器会对另一模态的编码器产生负面影响。为缓解该问题,我们引入一种两阶段融合调优策略:通过自适应掩码方法使单模态编码器协同工作,同时更有效地利用预训练知识。实验结果表明,我们的方法能够进一步挖掘预训练模型的潜力,并在音频-视觉学习中提升性能。