Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over the text-only approach and attains performance parity with its full fine-tuning (FFT) counterpart while needing to tune only a fraction of its parameters. Furthermore, with the newly introduced adapter dropout, FLoRA is robust to missing data, improving over FFT by 20% lower EER and 56% lower false accept rate. The proposed approach scales well for model sizes from 16M to 3B parameters.
翻译:尽管大语言模型(LLM)在类人对话方面展现出潜力,但其主要基于文本数据进行预训练。融入音频或视频信息可提升性能,但大规模多模态数据收集与多模态大语言模型预训练面临挑战。为此,我们提出一种融合低秩适配(FLoRA)技术,通过低秩适配高效地将预训练的单模态大语言模型适配至处理先前未见的新模态。在设备指向语音检测任务中,采用FLoRA的多模态大语言模型相比纯文本方法实现了22%的等错误率(EER)相对降低,并在仅需微调少量参数的情况下,达到了与全参数微调(FFT)方法相当的性能。此外,通过新引入的适配器丢弃机制,FLoRA对数据缺失具有鲁棒性,其等错误率较FFT降低20%,错误接受率降低56%。所提方法在参数量从1600万到30亿的模型规模上均表现出良好的扩展性。