In this paper, we employ Singular Value Canonical Correlation Analysis (SVCCA) to analyze representations learnt in a multilingual end-to-end speech translation model trained over 22 languages. SVCCA enables us to estimate representational similarity across languages and layers, enhancing our understanding of the functionality of multilingual speech translation and its potential connection to multilingual neural machine translation. The multilingual speech translation model is trained on the CoVoST 2 dataset in all possible directions, and we utilize LASER to extract parallel bitext data for SVCCA analysis. We derive three major findings from our analysis: (I) Linguistic similarity loses its efficacy in multilingual speech translation when the training data for a specific language is limited. (II) Enhanced encoder representations and well-aligned audio-text data significantly improve translation quality, surpassing the bilingual counterparts when the training data is not compromised. (III) The encoder representations of multilingual speech translation demonstrate superior performance in predicting phonetic features in linguistic typology prediction. With these findings, we propose that releasing the constraint of limited data for low-resource languages and subsequently combining them with linguistically related high-resource languages could offer a more effective approach for multilingual end-to-end speech translation.
翻译:在本文中,我们采用奇异值典型相关分析(SVCCA)来研究经过22种语言训练的多语种端到端语音翻译模型所学习到的表示。SVCCA使我们能够估计语言间和层级间的表示相似性,从而加深对多语种语音翻译功能及其与多语种神经机器翻译潜在联系的理解。该多语种语音翻译模型在CoVoST 2数据集上以所有可能的方向进行训练,我们利用LASER提取平行双文本数据用于SVCCA分析。通过分析得出三项主要结论:(I)当特定语言的训练数据有限时,语言相似性在多语种语音翻译中失去其有效性。(II)增强的编码器表示和良好对齐的音-文数据显著提升了翻译质量,在训练数据未受损的情况下超越了双语对照模型。(III)多语种语音翻译的编码器表示在语言类型学预测中对语音特征的预测表现出优越性能。基于这些发现,我们提出解除低资源语言的数据限制约束,随后将其与语言学相关的高资源语言相结合,可能为多语种端到端语音翻译提供更有效的途径。