Current approaches for large audio language models (LALMs) often rely on closed data sources or proprietary models, limiting their generalization and accessibility. This paper introduces MiDashengLM, a novel open audio-language model designed for efficient and comprehensive audio understanding through the use of general audio captions using our novel ACAVCaps training dataset. MiDashengLM exclusively relies on publicly available pretraining and supervised fine-tuning (SFT) datasets, ensuring full transparency and reproducibility. At its core, MiDashengLM integrates Dasheng, an open-source audio encoder, specifically engineered to process diverse auditory information effectively. Unlike previous works primarily focused on Automatic Speech Recognition (ASR) based audio-text alignment, our strategy centers on general audio captions, fusing speech, sound and music information into one textual representation, enabling a holistic textual representation of complex audio scenes. Lastly, MiDashengLM provides an up to 4x speedup in terms of time-to-first-token (TTFT) and up to 20x higher throughput than comparable models. Checkpoints are available online at https://huggingface.co/mispeech/midashenglm-7b and https://github.com/xiaomi-research/dasheng-lm.
翻译:当前大型音频语言模型(LALMs)常依赖封闭数据源或专有模型,限制了其泛化能力和可获取性。本文提出MiDashengLM——一种新颖的开源音频语言模型,通过利用我们创新的ACAVCaps训练数据集中的通用音频描述,实现高效且全面的音频理解。MiDashengLM完全基于公开可用的预训练和监督微调(SFT)数据集,确保完全的透明性和可复现性。其核心集成了Dasheng——一种专门设计用于高效处理多样化听觉信息的开源音频编码器。与先前主要聚焦于基于自动语音识别(ASR)的音频-文本对齐工作不同,我们的策略以通用音频描述为中心,将语音、声音和音乐信息融合为单一文本表征,从而实现对复杂音频场景的整体性文本表达。最后,MiDashengLM在首个令牌生成时间(TTFT)上实现了高达4倍的加速,吞吐量较同类模型提升达20倍。模型检查点已公开于https://huggingface.co/mispeech/midashenglm-7b 和 https://github.com/xiaomi-research/dasheng-lm。