Recent advancements in large-scale models have showcased remarkable generalization capabilities in various tasks. However, integrating multimodal processing into these models presents a significant challenge, as it often comes with a high computational burden. To address this challenge, we introduce a new parameter-efficient multimodal tuning strategy for large models in this paper, referred to as Multimodal Infusion Tuning (MiT). MiT leverages decoupled self-attention mechanisms within large language models to effectively integrate information from diverse modalities such as images and acoustics. In MiT, we also design a novel adaptive rescaling strategy at the attention head level, which optimizes the representation of infused multimodal features. Notably, all foundation models are kept frozen during the tuning process to reduce the computational burden and only 2.5\% parameters are tunable. We conduct experiments across a range of multimodal tasks, including image-related tasks like referring segmentation and non-image tasks such as sentiment analysis. Our results showcase that MiT achieves state-of-the-art performance in multimodal understanding while significantly reducing computational overhead(10\% of previous methods). Moreover, our tuned model exhibits robust reasoning abilities even in complex scenarios.
翻译:近年来,大规模模型在各种任务中展现出卓越的泛化能力。然而,将多模态处理集成到这些模型中仍面临重大挑战,因其通常伴随着高昂的计算负担。为应对这一挑战,本文针对大模型提出了一种新的参数高效多模态调优策略,称为多模态注入调优。MiT利用大语言模型中的解耦自注意力机制,有效整合来自图像、声学等多种模态的信息。在MiT中,我们还在注意力头层面设计了一种新颖的自适应重缩放策略,以优化注入的多模态特征表示。值得注意的是,在调优过程中所有基础模型均保持冻结状态以减轻计算负担,仅需调整2.5%的参数。我们在包括图像相关任务(如指代分割)和非图像任务(如情感分析)在内的一系列多模态任务上进行了实验。结果表明,MiT在多模态理解任务中取得了最先进的性能,同时显著降低了计算开销(仅为先前方法的10%)。此外,我们调优后的模型即使在复杂场景中也展现出强大的推理能力。