Hyperscaling of data and parameter count in LLMs is yielding diminishing improvement when weighed against training costs, underlining a growing need for more efficient finetuning and inference without sacrificing performance. This is especially so for multimodal language models (MLMs), where the overhead of processing multimodal tokens can limit their practical viability. Parallely, recent work has uncovered implicit cross-modal alignment in the deeper layers of large MLMs, deepening our understanding of how MLMs process and encode information. Motivated by this, and our observation that MLMs naturally defer most cross-modal token interactions to deeper layers of the model, we propose a simple modification. Instead of concatenation with the language prompt at the start, we insert multimodal tokens directly into the middle, allowing them to entirely bypass the early layers. Our results with diverse modalities, (i) LLaVA \& BLIP for vision, (ii) LTU for audio, and (iii) MoLCA for molecular data, and model sizes, starting from 350M to 13B parameters, indicate that our method reduces both training and inference costs, while at least preserving, if not surpassing the performance of existing baselines.
翻译:大型语言模型(LLM)中数据和参数规模的超量扩展,相较于训练成本所带来的改进正逐渐减弱,这凸显了在不牺牲性能的前提下,对更高效的微调和推理日益增长的需求。对于多模态语言模型(MLM)而言尤其如此,处理多模态令牌的开销可能限制其实际可行性。与此同时,近期研究揭示了大型MLM深层中隐含的跨模态对齐,加深了我们对MLM如何处理和编码信息的理解。受此启发,并结合我们观察到MLM自然地将大部分跨模态令牌交互推迟到模型的更深层,我们提出了一种简单的修改方案。我们不是在一开始就将多模态令牌与语言提示进行拼接,而是将其直接插入到模型中间,使其完全绕过早期层。我们在多种模态((i) 用于视觉的LLaVA和BLIP,(ii) 用于音频的LTU,以及(iii) 用于分子数据的MoLCA)和模型规模(参数从3.5亿到130亿)上的实验结果表明,我们的方法降低了训练和推理成本,同时至少保持、甚至超越了现有基线的性能。