Large language models (LLMs) have transformed NLP, yet their integration with audio remains underexplored despite audio's centrality to human communication. We introduce Falcon3-Audio, a family of Audio-Language Models (ALMs) built on instruction-tuned LLMs and Whisper encoders. Using a remarkably small amount of public audio data, less than 30K hours (5K unique), Falcon3-Audio-7B matches the best reported performance among open-weight models on the MMAU benchmark, with a score of 64.14, matching R1-AQA, while distinguishing itself through superior data and parameter efficiency, single-stage training, and transparency. Notably, our smallest 1B model remains competitive with larger open models ranging from 2B to 13B parameters. Through extensive ablations, we find that common complexities such as curriculum learning, multiple audio encoders, and intricate cross-attention connectors are not required for strong performance, even compared to models trained on over 500K hours of data.
翻译:大语言模型(LLMs)已彻底改变了自然语言处理领域,然而尽管音频在人类交流中占据核心地位,其与音频的融合研究仍显不足。我们推出了Falcon3-Audio系列——一组基于指令微调LLMs与Whisper编码器构建的音频-语言模型(ALMs)。仅使用极少量(少于3万小时,含5千条独立数据)的公开音频数据进行训练,Falcon3-Audio-7B在MMAU基准测试中取得了64.14分的成绩,与当前最优的开源权重模型R1-AQA持平,同时凭借卓越的数据与参数效率、单阶段训练流程及完全透明的训练方案脱颖而出。值得注意的是,我们最小的10亿参数模型仍能与参数规模在20亿至130亿之间的其他开源大模型保持竞争力。通过大量消融实验,我们发现即使与使用超过50万小时数据训练的模型相比,常见的复杂设计要素——如课程学习、多重音频编码器以及复杂的交叉注意力连接机制——对于实现优异性能并非必需。