Recent advances in large language models (LLMs) have gained interest in speech-text multimodal foundation models, achieving strong performance on instruction-based speech translation (ST). However, expanding language pairs from an existing instruction-tuned ST system is costly due to the necessity of re-training on a combination of new and previous datasets. We propose to expand new language pairs by merging the model trained on new language pairs and the existing model, using task arithmetic. We find that the direct application of task arithmetic for ST causes the merged model to fail to follow instructions; thus, generating translation in incorrect languages. To eliminate language confusion, we propose an augmented task arithmetic method that merges an additional language control model. It is trained to generate the correct target language token following the instructions. Our experiments demonstrate that our proposed language control model can achieve language expansion by eliminating language confusion. In our MuST-C and CoVoST-2 experiments, it shows up to 4.66 and 4.92 BLEU scores improvement, respectively. In addition, we demonstrate the use of our task arithmetic framework can expand to a language pair where neither paired ST training data nor a pre-trained ST model is available. We first synthesize the ST system from machine translation (MT) systems via task analogy, then merge the synthesized ST system to the existing ST model.
翻译:近年来,大型语言模型(LLMs)的进展引发了人们对语音-文本多模态基础模型的兴趣,这类模型在基于指令的语音翻译(ST)任务上取得了优异性能。然而,由于需要在新旧数据集组合上重新训练,扩展现有指令调优ST系统的语言对成本高昂。我们提出通过任务算术方法,将训练于新语言对的模型与现有模型进行融合,从而实现新语言对的扩展。我们发现,直接将任务算术应用于ST会导致融合模型无法遵循指令,从而生成错误语言的翻译结果。为消除语言混淆,我们提出一种增强型任务算术方法,该方法额外融合了一个语言控制模型。该模型经过训练,能够根据指令生成正确的目标语言标记。实验表明,我们提出的语言控制模型能够通过消除语言混淆实现语言扩展。在MuST-C和CoVoST-2数据集上的实验显示,该方法分别实现了最高4.66和4.92 BLEU分数的提升。此外,我们证明了该任务算术框架可扩展至既无配对ST训练数据、也无预训练ST模型可用的语言对。我们首先通过任务类比从机器翻译(MT)系统合成ST系统,随后将合成的ST系统与现有ST模型进行融合。