In this work, we study the features extracted by English self-supervised learning (SSL) models in cross-lingual contexts and propose a new metric to predict the quality of feature representations. Using automatic speech recognition (ASR) as a downstream task, we analyze the effect of model size, training objectives, and model architecture on the models' performance as a feature extractor for a set of topologically diverse corpora. We develop a novel metric, the Phonetic-Syntax Ratio (PSR), to measure the phonetic and synthetic information in the extracted representations using deep generalized canonical correlation analysis. Results show the contrastive loss in the wav2vec2.0 objective facilitates more effective cross-lingual feature extraction. There is a positive correlation between PSR scores and ASR performance, suggesting that phonetic information extracted by monolingual SSL models can be used for downstream tasks in cross-lingual settings. The proposed metric is an effective indicator of the quality of the representations and can be useful for model selection.
翻译:本研究探讨了英语自监督学习(SSL)模型在跨语言语境中提取的特征,并提出了一种新的度量标准来预测特征表示的质量。以自动语音识别(ASR)作为下游任务,我们分析了模型规模、训练目标和模型架构对一组拓扑多样性语料库上特征提取器性能的影响。我们开发了一种新型度量指标——语音-句法比(PSR),利用深度广义典型相关分析来度量提取表示中的语音和句法信息。结果表明,wav2vec2.0目标中的对比损失有助于更有效的跨语言特征提取。PSR分数与ASR性能之间存在正相关关系,表明单语SSL模型提取的语音信息可应用于跨语言环境下的下游任务。所提出的度量指标是表示质量的有效指标,有助于模型选择。