MiDashengLM: Efficient Audio Understanding with General Audio Captions

Current approaches for large audio language models (LALMs) often rely on closed data sources or proprietary models, limiting their generalization and accessibility. This paper introduces MiDashengLM, a novel open audio-language model designed for efficient and comprehensive audio understanding through the use of general audio captions using our novel ACAVCaps training dataset. MiDashengLM exclusively relies on publicly available pretraining and supervised fine-tuning (SFT) datasets, ensuring full transparency and reproducibility. At its core, MiDashengLM integrates Dasheng, an open-source audio encoder, specifically engineered to process diverse auditory information effectively. Unlike previous works primarily focused on Automatic Speech Recognition (ASR) based audio-text alignment, our strategy centers on general audio captions, fusing speech, sound and music information into one textual representation, enabling a holistic textual representation of complex audio scenes. Lastly, MiDashengLM provides an up to 4x speedup in terms of time-to-first-token (TTFT) and up to 20x higher throughput than comparable models. Checkpoints are available online at https://huggingface.co/mispeech/midashenglm-7b and https://github.com/xiaomi-research/dasheng-lm.

翻译：当前大型音频语言模型（LALMs）常依赖封闭数据源或专有模型，限制了其泛化能力和可获取性。本文提出MiDashengLM——一种新颖的开源音频语言模型，通过利用我们创新的ACAVCaps训练数据集中的通用音频描述，实现高效且全面的音频理解。MiDashengLM完全基于公开可用的预训练和监督微调（SFT）数据集，确保完全的透明性和可复现性。其核心集成了Dasheng——一种专门设计用于高效处理多样化听觉信息的开源音频编码器。与先前主要聚焦于基于自动语音识别（ASR）的音频-文本对齐工作不同，我们的策略以通用音频描述为中心，将语音、声音和音乐信息融合为单一文本表征，从而实现对复杂音频场景的整体性文本表达。最后，MiDashengLM在首个令牌生成时间（TTFT）上实现了高达4倍的加速，吞吐量较同类模型提升达20倍。模型检查点已公开于https://huggingface.co/mispeech/midashenglm-7b 和 https://github.com/xiaomi-research/dasheng-lm。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【综述】大型音频语言模型综述：泛化、可信与未来展望

专知会员服务

13+阅读 · 5月21日

【ICML2025】迈向多模态通用人工智能之路：通用级别与通用基准

专知会员服务

23+阅读 · 2025年5月8日

大模型如何做视频理解？最新《多模态大语言模型在全面长视频理解》综述

专知会员服务

30+阅读 · 2024年10月2日

《多模态大语言模型评估综述》

专知会员服务

41+阅读 · 2024年8月29日