Large language models (LLMs) have transformed NLP, yet their integration with audio remains underexplored despite audio's centrality to human communication. We introduce Falcon3-Audio, a family of Audio-Language Models (ALMs) built on instruction-tuned LLMs and Whisper encoders. Using a remarkably small amount of public audio data, less than 30K hours (5K unique), Falcon3-Audio-7B matches the best reported performance among open-weight models on the MMAU benchmark, with a score of 64.14, matching R1-AQA, while distinguishing itself through superior data and parameter efficiency, single-stage training, and transparency. Notably, our smallest 1B model remains competitive with larger open models ranging from 2B to 13B parameters. Through extensive ablations, we find that common complexities such as curriculum learning, multiple audio encoders, and intricate cross-attention connectors are not required for strong performance, even compared to models trained on over 500K hours of data.
翻译:大型语言模型(LLM)已彻底改变自然语言处理领域,然而尽管音频在人类交流中占据核心地位,其与音频模态的融合仍未被充分探索。本文提出Falcon3-Audio系列模型——一种基于指令调优LLM与Whisper编码器构建的音频-语言模型(ALM)。通过使用极少量(低于3万小时/5千条独立数据)的公开音频数据,Falcon3-Audio-7B在MMAU基准测试中以64.14分的成绩与当前最优开源模型R1-AQA持平,同时在数据效率、参数效率、单阶段训练和透明度方面表现卓越。值得注意的是,我们最小的10亿参数模型仍能与20亿至130亿参数规模的开源模型保持竞争力。通过大量消融实验发现,即使与训练数据量超过50万小时的模型相比,课程学习、多音频编码器、复杂交叉注意力连接器等常见复杂设计对实现优异性能并非必需。