Large Language Models (LLMs) are capable of reasoning over diverse input data modalities through pre-trained encoders. However, the growing diversity of input data modalities prevents incorporating all modalities into LLMs, especially when LLMs are deployed on resource-constrained edge devices for embodied AI applications. Instead, a better option is to adaptively involve only the useful modalities at runtime, depending on the current environmental contexts and task requirements. For such modality adaptation, existing work adopts fixed connections between encoders and the LLM's input layer, leading to high training cost at runtime and ineffective cross-modal interaction. In this paper, we address these limitations by presenting mPnP-LLM, a new technique that allows fully elastic, automated and prompt runtime modality adaptation, by connecting unimodal encoders to a flexible set of last LLM blocks and making such latent connections fully trainable at runtime. Experiments over the nuScenes-QA dataset show that mPnP-LLM can achieve up to 3.7x FLOPs reduction and 30% GPU memory usage reduction, while retaining on-par accuracy with the existing schemes. Under the same compute budget, mPnP-LLM improves the task accuracy by up to 4% compared to the best existing scheme.
翻译:大语言模型(LLMs)能够通过预训练编码器对多样化的输入数据模态进行推理。然而,输入数据模态的持续增长使得将所有模态集成到LLMs中变得困难,特别是在将LLMs部署于资源受限的边缘设备上以应用于具身智能场景时。更优的方案是根据当前环境上下文和任务需求,在运行时自适应地仅引入有用模态。针对此类模态适配问题,现有工作采用编码器与LLM输入层之间的固定连接,导致运行时训练成本高且跨模态交互效率低下。本文提出mPnP-LLM技术以解决上述局限,该技术通过将单模态编码器与灵活选择的LLM最后若干模块层连接,并使此类隐式连接在运行时完全可训练,实现了完全弹性、自动化且即时的运行时模态适配。在nuScenes-QA数据集上的实验表明,与现有方案相比,mPnP-LLM在保持同等精度的前提下可降低最多3.7倍的计算开销(FLOPs)和30%的GPU内存占用;在相同计算预算下,相较最优现有方案,任务精度提升可达4%。