Significant improvements in end-to-end speech translation (ST) have been achieved through the application of multi-task learning. However, the extent to which auxiliary tasks are highly consistent with the ST task, and how much this approach truly helps, have not been thoroughly studied. In this paper, we investigate the consistency between different tasks, considering different times and modules. We find that the textual encoder primarily facilitates cross-modal conversion, but the presence of noise in speech impedes the consistency between text and speech representations. Furthermore, we propose an improved multi-task learning (IMTL) approach for the ST task, which bridges the modal gap by mitigating the difference in length and representation. We conduct experiments on the MuST-C dataset. The results demonstrate that our method attains state-of-the-art results. Moreover, when additional data is used, we achieve the new SOTA result on MuST-C English to Spanish task with 20.8% of the training time required by the current SOTA method.
翻译:通过应用多任务学习,端到端语音翻译(ST)取得了显著进展。然而,辅助任务与ST任务在多大程度上高度一致,以及这种方法究竟有多大帮助,尚未得到深入研究。本文从不同时间和模块的角度,探究了不同任务之间的一致性。我们发现,文本编码器主要促进跨模态转换,但语音中的噪声阻碍了文本与语音表示之间的一致性。此外,我们提出了一种针对ST任务的改进多任务学习(IMTL)方法,通过缓解长度和表示上的差异来弥合模态差距。我们在MuST-C数据集上进行了实验。结果表明,我们的方法达到了最先进的水平。此外,在使用额外数据时,我们在MuST-C英译西任务上以当前最先进方法所需训练时间的20.8%取得了新的最优结果。