What do self-supervised speech models know about words?

Many self-supervised speech models (S3Ms) have been introduced over the last few years, producing performance and data efficiency improvements for a variety of speech tasks. Evidence is emerging that different S3Ms encode linguistic information in different layers, and also that some S3Ms appear to learn phone-like sub-word units. However, the extent to which these models capture larger linguistic units, such as words, and where word-related information is encoded, remains unclear. In this study, we conduct several analyses of word segment representations extracted from different layers of three S3Ms: wav2vec2, HuBERT, and WavLM. We employ canonical correlation analysis (CCA), a lightweight analysis tool, to measure the similarity between these representations and word-level linguistic properties. We find that the maximal word-level linguistic content tends to be found in intermediate model layers, while some lower-level information like pronunciation is also retained in higher layers of HuBERT and WavLM. Syntactic and semantic word attributes have similar layer-wise behavior. We also find that, for all of the models tested, word identity information is concentrated near the center of each word segment. We then test the layer-wise performance of the same models, when used directly with no additional learned parameters, on several tasks: acoustic word discrimination, word segmentation, and semantic sentence similarity. We find similar layer-wise trends in performance, and furthermore, find that when using the best-performing layer of HuBERT or WavLM, it is possible to achieve performance on word segmentation and sentence similarity that rivals more complex existing approaches.

翻译：近年来，多种自监督语音模型（S3Ms）相继被提出，显著提升了多种语音任务的性能和数据利用效率。已有证据表明，不同S3Ms在不同层级编码语言信息，且部分模型似乎能学习类音素级别的子词单元。然而，这些模型能在多大程度上捕获词汇等更大语言单元，以及词汇相关信息在模型中的编码位置，仍不明确。本研究针对三种S3Ms（wav2vec2、HuBERT和WavLM）不同层级提取的词段表征展开多项分析。我们采用典型相关分析（CCA）这一轻量级分析工具，测量这些表征与词级语言属性间的相似度。研究发现：最大词级语言内容通常出现在模型中间层，而HuBERT和WavLM的高层级仍保留发音等较低层信息；句法与语义词属性呈现相似的层级分布特征。在全部测试模型中，词汇身份信息高度集中在每个词段中心区域。随后，我们在无需额外学习参数的情况下，直接检验这些模型在声学词汇判别、词分割和语义句子相似性任务中的层级性能。实验发现性能呈现相似层级趋势，且采用HuBERT或WavLM最优层时，词分割与句子相似性任务可达到与现有复杂方法相媲美的性能。