As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond single-domain capabilities is essential to meet the demands for more versatile and efficient AI. However, previous omni-models have insufficiently explored speech, neglecting its integration with multi-modality. We introduce Lyra, an efficient MLLM that enhances multimodal abilities, including advanced long-speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction. To achieve efficiency and speech-centric capabilities, Lyra employs three strategies: (1) leveraging existing open-source large models and a proposed multi-modality LoRA to reduce training costs and data requirements; (2) using a latent multi-modality regularizer and extractor to strengthen the relationship between speech and other modalities, thereby enhancing model performance; and (3) constructing a high-quality, extensive dataset that includes 1.5M multi-modal (language, vision, audio) data samples and 12K long speech samples, enabling Lyra to handle complex long speech inputs and achieve more robust omni-cognition. Compared to other omni-methods, Lyra achieves state-of-the-art performance on various vision-language, vision-speech, and speech-language benchmarks, while also using fewer computational resources and less training data.
翻译:随着多模态大语言模型(MLLMs)的发展,超越单一领域的能力对于满足对更通用、更高效人工智能的需求至关重要。然而,以往的全能模型对语音模态的探索不足,忽视了其与多模态的融合。我们提出了Lyra,一种高效的MLLM,它增强了多模态能力,包括先进的长语音理解、声音理解、跨模态效率以及无缝的语音交互。为实现高效性和以语音为中心的能力,Lyra采用了三种策略:(1)利用现有的开源大模型和提出的多模态LoRA,以降低训练成本和数据需求;(2)使用潜在多模态正则化器和提取器,以加强语音与其他模态之间的关系,从而提升模型性能;(3)构建了一个高质量、大规模的数据集,包含150万个多模态(语言、视觉、音频)数据样本和1.2万个长语音样本,使Lyra能够处理复杂的长语音输入并实现更鲁棒的全知认知。与其他全能方法相比,Lyra在各种视觉-语言、视觉-语音和语音-语言基准测试中实现了最先进的性能,同时使用了更少的计算资源和训练数据。