In this work, we undertake the challenge of augmenting the existing generative capabilities of pre-trained text-only large language models (LLMs) with multi-modal generation capability while satisfying two core constraints: C1 preserving the preservation of original language generative capabilities with negligible performance degradation, and C2 adhering to a small parameter budget to learn the new modality, ensuring scalability and efficiency. In contrast to current approaches that add dedicated modules, thereby significantly increasing the parameter count, we propose a method that leverages the underutilized capacity inherent in deep models. Specifically, we exploit the parameter redundancy within Mixture-of-Experts (MoEs) as a source of additional capacity for learning a new modality, enabling better parameter efficiency (C1). Moreover, we preserve the original language generation capabilities by applying low-rank adaptation exclusively to the tokens of the new modality (C2). Furthermore, we introduce a novel parameter initialization scheme based on the Gromov-Wasserstein distance to improve convergence and training stability. Through an extensive analysis of the routing mechanism, we uncover the emergence of modality-specific pathways and decreased redundancy within the experts that can efficiently unlock multi-modal generative capabilities. Overall, our method can be seamlessly applied to a wide range of contemporary LLMs, providing a new pathway for transitioning from uni-modal to multi-modal architectures.
翻译:在本研究中,我们致力于解决一项挑战:为预训练的纯文本大语言模型(LLMs)增强多模态生成能力,同时满足两个核心约束条件:C1 保持原有语言生成能力且性能下降可忽略不计;C2 遵循小参数量预算学习新模态,确保可扩展性与效率。与当前通过添加专用模块导致参数量显著增加的方法不同,我们提出一种利用深度模型固有未充分利用容量的方法。具体而言,我们利用专家混合(MoEs)内部的参数冗余作为学习新模态的额外容量来源,从而实现更优的参数效率(C1)。此外,我们通过仅对新模态的标记应用低秩适配来保留原始语言生成能力(C2)。进一步地,我们引入一种基于Gromov-Wasserstein距离的新型参数初始化方案以提升收敛性与训练稳定性。通过对路由机制的深入分析,我们揭示了模态专用路径的出现以及专家内部冗余度的降低,这些机制能有效解锁多模态生成能力。总体而言,我们的方法可无缝应用于广泛的当代LLMs,为从单模态到多模态架构的过渡提供了新路径。