Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Zhisheng Zhong,Chengyao Wang,Yuqi Liu,Senqiao Yang,Longxiang Tang,Yuechen Zhang,Jingyao Li,Tianyuan Qu,Yanwei Li,Yukang Chen,Shaozuo Yu,Sitong Wu,Eric Lo,Shu Liu,Jiaya Jia

from arxiv, Tech report

As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond single-domain capabilities is essential to meet the demands for more versatile and efficient AI. However, previous omni-models have insufficiently explored speech, neglecting its integration with multi-modality. We introduce Lyra, an efficient MLLM that enhances multimodal abilities, including advanced long-speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction. To achieve efficiency and speech-centric capabilities, Lyra employs three strategies: (1) leveraging existing open-source large models and a proposed multi-modality LoRA to reduce training costs and data requirements; (2) using a latent multi-modality regularizer and extractor to strengthen the relationship between speech and other modalities, thereby enhancing model performance; and (3) constructing a high-quality, extensive dataset that includes 1.5M multi-modal (language, vision, audio) data samples and 12K long speech samples, enabling Lyra to handle complex long speech inputs and achieve more robust omni-cognition. Compared to other omni-methods, Lyra achieves state-of-the-art performance on various vision-language, vision-speech, and speech-language benchmarks, while also using fewer computational resources and less training data.

翻译：随着多模态大语言模型（MLLMs）的发展，超越单一领域的能力对于满足对更通用、更高效人工智能的需求至关重要。然而，以往的全能模型对语音模态的探索不足，忽视了其与多模态的融合。我们提出了Lyra，一种高效的MLLM，它增强了多模态能力，包括先进的长语音理解、声音理解、跨模态效率以及无缝的语音交互。为实现高效性和以语音为中心的能力，Lyra采用了三种策略：（1）利用现有的开源大模型和提出的多模态LoRA，以降低训练成本和数据需求；（2）使用潜在多模态正则化器和提取器，以加强语音与其他模态之间的关系，从而提升模型性能；（3）构建了一个高质量、大规模的数据集，包含150万个多模态（语言、视觉、音频）数据样本和1.2万个长语音样本，使Lyra能够处理复杂的长语音输入并实现更鲁棒的全知认知。与其他全能方法相比，Lyra在各种视觉-语言、视觉-语音和语音-语言基准测试中实现了最先进的性能，同时使用了更少的计算资源和训练数据。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日