In end-to-end speech translation, speech and text pre-trained models improve translation quality. Recently proposed models simply connect the pre-trained models of speech and text as encoder and decoder. Therefore, only the information from the final layer of encoders is input to the decoder. Since it is clear that the speech pre-trained model outputs different information from each layer, the simple connection method cannot fully utilize the information that the speech pre-trained model has. In this study, we propose an inter-connection mechanism that aggregates the information from each layer of the speech pre-trained model by weighted sums and inputs into the decoder. This mechanism increased BLEU by approximately 2 points in en-de, en-ja, and en-zh by increasing parameters by 2K when the speech pre-trained model was frozen. Furthermore, we investigated the contribution of each layer for each language by visualizing layer weights and found that the contributions were different.
翻译:在端到端语音翻译中,语音和文本预训练模型能够提升翻译质量。近期提出的模型简单地将语音和文本预训练模型分别作为编码器和解码器进行连接,因此仅将编码器最后一层的信息输入解码器。由于语音预训练模型各层输出的信息存在差异,这种简单连接方式无法充分利用语音预训练模型所蕴含的信息。本研究提出了一种跨连接机制,通过加权求和的方式聚合语音预训练模型各层的信息,并将其输入解码器。该机制在冻结语音预训练模型的情况下,仅增加2K参数,即在英德、英日和英中翻译任务上使BLEU值提升了约2个百分点。此外,我们通过可视化层权重探究了各语言中各层的贡献,发现不同语言中各层贡献存在差异。