Large Audio Language Models (LALMs) have emerged as powerful tools for speech-related tasks but remain underexplored for fine-tuning, especially with limited speech data. To bridge this gap, we systematically examine how different fine-tuning schemes including text-only, direct mixing, and curriculum learning affect spoken language understanding (SLU), focusing on scenarios where text-label pairs are abundant while paired speech-label data are limited. Results show that LALMs already achieve competitive performance with text-only fine-tuning, highlighting their strong generalization ability. Adding even small amounts of speech data (2-5%) yields substantial further gains, with curriculum learning particularly effective under scarce data. In cross-lingual SLU, combining source-language speech data with target-language text and minimal target-language speech data enables effective adaptation. Overall, this study provides practical insights into the LALM fine-tuning under realistic data constraints.
翻译:大规模音频语言模型已成为语音相关任务的强大工具,但其微调方法尤其在有限语音数据条件下仍未得到充分探索。为填补这一空白,我们系统研究了不同微调方案(包括纯文本微调、直接混合训练和课程学习)对口语理解的影响,重点关注文本-标签对数据充足而配对语音-标签数据有限的场景。实验结果表明,仅通过纯文本微调,大规模音频语言模型已能取得具有竞争力的性能,凸显其强大的泛化能力。即使添加少量语音数据(2-5%)也能带来显著提升,其中课程学习在数据稀缺条件下尤为有效。在跨语言口语理解任务中,结合源语言语音数据与目标语言文本及极少量的目标语言语音数据即可实现有效适配。总体而言,本研究为实际数据约束条件下的大规模音频语言模型微调提供了实用见解。