Recent advancements in large-scale models have showcased remarkable generalization capabilities in various tasks. However, integrating multimodal processing into these models presents a significant challenge, as it often comes with a high computational burden. To address this challenge, we introduce a new parameter-efficient multimodal tuning strategy for large models in this paper, referred to as Multimodal Infusion Tuning (MiT). MiT leverages decoupled self-attention mechanisms within large language models to effectively integrate information from diverse modalities such as images and acoustics. In MiT, we also design a novel adaptive rescaling strategy at the attention head level, which optimizes the representation of infused multimodal features. Notably, all foundation models are kept frozen during the tuning process to reduce the computational burden and only 2.5\% parameters are tunable. We conduct experiments across a range of multimodal tasks, including image-related tasks like referring segmentation and non-image tasks such as sentiment analysis. Our results showcase that MiT achieves state-of-the-art performance in multimodal understanding while significantly reducing computational overhead(10\% of previous methods). Moreover, our tuned model exhibits robust reasoning abilities even in complex scenarios.
翻译:近期大规模模型的进展在各种任务中展现出卓越的泛化能力。然而,将多模态处理集成到这些模型中仍面临重大挑战,因其通常伴随高昂的计算负担。为应对这一挑战,本文提出一种面向大规模模型的新型参数高效多模态微调策略,称为多模态注入微调(MiT)。MiT利用大规模语言模型中的解耦自注意力机制,有效整合来自图像与声学等多种模态的信息。在MiT中,我们还在注意力头层级设计了一种新颖的自适应重缩放策略,以优化注入多模态特征的表示。值得注意的是,所有基座模型在微调过程中均保持冻结状态以降低计算负担,仅需调整2.5%的参数。我们在包括指代分割等图像相关任务与情感分析等非图像任务在内的多模态任务系列上开展实验。结果表明,MiT在多模态理解任务中实现了最先进的性能,同时显著降低了计算开销(仅为先前方法的10%)。此外,我们微调后的模型在复杂场景下仍展现出强大的推理能力。