Many self-supervised speech models (S3Ms) have been introduced over the last few years, producing performance and data efficiency improvements for a variety of speech tasks. Evidence is emerging that different S3Ms encode linguistic information in different layers, and also that some S3Ms appear to learn phone-like sub-word units. However, the extent to which these models capture larger linguistic units, such as words, and where word-related information is encoded, remains unclear. In this study, we conduct several analyses of word segment representations extracted from different layers of three S3Ms: wav2vec2, HuBERT, and WavLM. We employ canonical correlation analysis (CCA), a lightweight analysis tool, to measure the similarity between these representations and word-level linguistic properties. We find that the maximal word-level linguistic content tends to be found in intermediate model layers, while some lower-level information like pronunciation is also retained in higher layers of HuBERT and WavLM. Syntactic and semantic word attributes have similar layer-wise behavior. We also find that, for all of the models tested, word identity information is concentrated near the center of each word segment. We then test the layer-wise performance of the same models, when used directly with no additional learned parameters, on several tasks: acoustic word discrimination, word segmentation, and semantic sentence similarity. We find similar layer-wise trends in performance, and furthermore, find that when using the best-performing layer of HuBERT or WavLM, it is possible to achieve performance on word segmentation and sentence similarity that rivals more complex existing approaches.
翻译:近年来,多种自监督语音模型(S3Ms)相继被提出,显著提升了多种语音任务的性能和数据利用效率。已有证据表明,不同S3Ms在不同层级编码语言信息,且部分模型似乎能学习类音素级别的子词单元。然而,这些模型能在多大程度上捕获词汇等更大语言单元,以及词汇相关信息在模型中的编码位置,仍不明确。本研究针对三种S3Ms(wav2vec2、HuBERT和WavLM)不同层级提取的词段表征展开多项分析。我们采用典型相关分析(CCA)这一轻量级分析工具,测量这些表征与词级语言属性间的相似度。研究发现:最大词级语言内容通常出现在模型中间层,而HuBERT和WavLM的高层级仍保留发音等较低层信息;句法与语义词属性呈现相似的层级分布特征。在全部测试模型中,词汇身份信息高度集中在每个词段中心区域。随后,我们在无需额外学习参数的情况下,直接检验这些模型在声学词汇判别、词分割和语义句子相似性任务中的层级性能。实验发现性能呈现相似层级趋势,且采用HuBERT或WavLM最优层时,词分割与句子相似性任务可达到与现有复杂方法相媲美的性能。