Multimodal learning often relies on designing new models and complex training strategies to achieve optimal performance. We present Unified Unimodal Adaptation (U2A), which jointly fine-tunes pretrained unimodal encoders using low-rank adaptation (LoRA) for various multimodal tasks. Our method significantly reduces the number of learnable parameters and eliminates the need for complex training strategies, such as alternating training, gradient modifications, or unimodal fine-tuning. To address missing modalities during both training and testing, we introduce Mask Tokens (MT), which generate missing modality features from available modalities using a single token per modality. This simplifies the process, removing the need for specialized feature estimation or prompt-tuning methods. Our evaluation demonstrates that U2A matches or outperforms state-of-the-art methods in both complete and missing modality settings, showcasing strong performance and robustness across various modalities, tasks, and datasets. We also analyze and report the effectiveness of Mask Tokens in different missing modality scenarios. Overall, our method provides a robust, flexible, and efficient solution for multimodal learning, with minimal computational overhead.
翻译:多模态学习通常依赖于设计新模型和复杂训练策略以实现最优性能。本文提出统一单模态适配(U2A)方法,通过低秩适配(LoRA)联合微调预训练单模态编码器以适用于多种多模态任务。该方法显著减少可学习参数量,并无需复杂训练策略(如交替训练、梯度修正或单模态微调)。为应对训练和测试阶段的模态缺失问题,我们引入掩码令牌(MT)机制,该机制通过每个模态的单一令牌从可用模态生成缺失模态特征。这简化了处理流程,无需专门的特征估计或提示调优方法。实验评估表明,U2A在完整模态和缺失模态设置下均达到或超越现有最优方法,在不同模态、任务和数据集上展现出卓越性能与鲁棒性。我们还分析并报告了掩码令牌在不同模态缺失场景中的有效性。总体而言,本方法以最小计算开销为多模态学习提供了鲁棒、灵活且高效的解决方案。