Self-supervised models have revolutionized speech processing, achieving new levels of performance in a wide variety of tasks with limited resources. However, the inner workings of these models are still opaque. In this paper, we aim to analyze the encoded contextual representation of these foundation models based on their inter- and intra-model similarity, independent of any external annotation and task-specific constraint. We examine different SSL models varying their training paradigm -- Contrastive (Wav2Vec2.0) and Predictive models (HuBERT); and model sizes (base and large). We explore these models on different levels of localization/distributivity of information including (i) individual neurons; (ii) layer representation; (iii) attention weights and (iv) compare the representations with their finetuned counterparts.Our results highlight that these models converge to similar representation subspaces but not to similar neuron-localized concepts\footnote{A concept represents a coherent fragment of knowledge, such as ``a class containing certain objects as elements, where the objects have certain properties. We made the code publicly available for facilitating further research, we publicly released our code.
翻译:自监督模型已彻底革新语音处理领域,在有限资源条件下于广泛任务中实现了前所未有的性能水平。然而,这些模型的内在机制仍不透明。本文旨在基于模型间与模型内相似性,独立于任何外部标注与任务特定约束,分析这些基础模型的编码上下文表征。我们考察了采用不同训练范式的自监督学习模型——对比式模型(Wav2Vec2.0)与预测式模型(HuBERT),以及不同模型规模(基础版与大型版)。我们从信息局部化/分布性的多个层面探究这些模型,包括:(i)单个神经元;(ii)层级表征;(iii)注意力权重;(iv)与微调对应模型的表征比较。研究结果表明,这些模型收敛至相似的表征子空间,但未形成相似的神经元局部化概念\footnote{概念代表连贯的知识片段,例如“包含特定对象作为元素且对象具有特定属性的类”}。为促进后续研究,我们已公开代码。