Model merging is a scalable alternative to multi-task training that combines the capabilities of multiple specialised models into a single model. This is particularly attractive for large speech foundation models, which are typically adapted through domain-specific fine-tuning, resulting in multiple customised checkpoints, for which repeating full fine-tuning when new data becomes available is computationally prohibitive. In this work, we study model merging for multi-domain ASR and benchmark 11 merging algorithms for 10 European Portuguese domains, evaluating in-domain accuracy, robustness under distribution shift, as well as English and multilingual performance. We further propose BoostedTSV-M, a new merging algorithm based on TSV-M that mitigates rank collapse via singular-value boosting and improves numerical stability. Overall, our approach outperforms full fine-tuning on European Portuguese while preserving out-of-distribution generalisation in a single model.
翻译:模型合并是一种可扩展的多任务训练替代方案,它将多个专用模型的能力整合到单一模型中。这对于大型语音基础模型尤其具有吸引力,这类模型通常通过领域特定微调进行适配,从而产生多个定制化检查点;当新数据可用时,重复完整微调在计算上是不可行的。在本工作中,我们研究了多领域自动语音识别中的模型合并方法,针对10个欧洲葡萄牙语领域对11种合并算法进行了基准测试,评估了领域内准确率、分布偏移下的鲁棒性以及英语和多语言性能。我们进一步提出了BoostedTSV-M——一种基于TSV-M的新型合并算法,该方法通过奇异值增强缓解秩崩溃问题并提升数值稳定性。总体而言,我们的方法在欧洲葡萄牙语任务上优于完整微调,同时在单一模型中保持了分布外泛化能力。