Language diversity presents a significant challenge in speech-to-text (S2T) tasks, such as automatic speech recognition and translation. Traditional multi-task training approaches aim to address this by jointly optimizing multiple speech recognition and translation tasks across various languages. While models like Whisper, built on these strategies, demonstrate strong performance, they still face issues of high computational cost, language interference, suboptimal training configurations, and limited extensibility. To overcome these challenges, we introduce LoRS-Merging (low-rank and sparse model merging), a novel technique designed to efficiently integrate models trained on different languages or tasks while preserving performance and reducing computational overhead. LoRS-Merging combines low-rank and sparse pruning to retain essential structures while eliminating redundant parameters, mitigating language and task interference, and enhancing extensibility. Experimental results across a range of languages demonstrate that LoRS-Merging reduces the word error rate by 10% and improves BLEU scores by 4% compared to conventional multi-lingual multi-task training baselines. Our findings suggest that model merging, particularly LoRS-Merging, is a scalable and effective complement to traditional multi-lingual training strategies for S2T applications.
翻译:语言多样性对语音到文本任务(如自动语音识别与翻译)构成了重大挑战。传统的多任务训练方法旨在通过联合优化跨多种语言的多个语音识别与翻译任务来解决这一问题。尽管基于这些策略构建的模型(如Whisper)表现出强大的性能,但仍面临计算成本高、语言干扰、训练配置欠佳以及可扩展性有限等问题。为克服这些挑战,我们提出了LoRS-Merging(低秩与稀疏模型融合)技术,这是一种旨在高效整合针对不同语言或任务训练的模型的新方法,同时保持性能并降低计算开销。LoRS-Merging结合了低秩近似与稀疏剪枝,以保留关键结构、消除冗余参数、减轻语言与任务干扰并增强可扩展性。跨多种语言的实验结果表明,相较于传统的多语言多任务训练基线方法,LoRS-Merging将词错误率降低了10%,并将BLEU分数提升了4%。我们的研究结果表明,模型融合(特别是LoRS-Merging)是语音到文本应用中传统多语言训练策略的一种可扩展且有效的补充。