To overcome the imbalanced multimodal learning problem, where models prefer the training of specific modalities, existing methods propose to control the training of uni-modal encoders from different perspectives, taking the inter-modal performance discrepancy as the basis. However, the intrinsic limitation of modality capacity is ignored. The scarcely informative modalities can be recognized as ``worse-learnt'' ones, which could force the model to memorize more noise, counterproductively affecting the multimodal model ability. Moreover, the current modality modulation methods narrowly concentrate on selected worse-learnt modalities, even suppressing the training of others. Hence, it is essential to consider the intrinsic limitation of modality capacity and take all modalities into account during balancing. To this end, we propose the Diagnosing \& Re-learning method. The learning state of each modality is firstly estimated based on the separability of its uni-modal representation space, and then used to softly re-initialize the corresponding uni-modal encoder. In this way, the over-emphasizing of scarcely informative modalities is avoided. In addition, encoders of worse-learnt modalities are enhanced, simultaneously avoiding the over-training of other modalities. Accordingly, multimodal learning is effectively balanced and enhanced. Experiments covering multiple types of modalities and multimodal frameworks demonstrate the superior performance of our simple-yet-effective method for balanced multimodal learning. The source code and dataset are available at \url{https://github.com/GeWu-Lab/Diagnosing_Relearning_ECCV2024}.
翻译:为克服多模态学习中模型倾向于特定模态训练的不平衡问题,现有方法从不同角度提出对单模态编码器的训练进行调控,并以模态间性能差异为依据。然而,这些方法忽视了模态能力的内在局限性。信息匮乏的模态可被视为"学习效果较差"的模态,这类模态可能迫使模型记忆更多噪声,反而损害多模态模型的整体能力。此外,当前模态调节方法仅狭隘地聚焦于选定的低效学习模态,甚至抑制其他模态的训练。因此,在平衡过程中必须考虑模态能力的内在局限性,并统筹所有模态。为此,我们提出诊断与再学习方法。首先基于各单模态表征空间的可分离性评估其学习状态,随后据此对相应单模态编码器进行柔性重初始化。这种方法避免了过度强调信息匮乏的模态。同时,学习效果较差的模态编码器得到增强,且其他模态的过度训练得以规避。由此,多模态学习实现了有效均衡与增强。覆盖多种模态类型和多模态框架的实验表明,我们这种简洁而高效的方法在均衡多模态学习方面具有优越性能。源代码与数据集已发布于 \url{https://github.com/GeWu-Lab/Diagnosing_Relearning_ECCV2024}。