Many self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks. However, these empirical successes alone do not give a complete picture of what is learned during pre-training. Recent work has begun analyzing how S3Ms encode certain properties, such as phonetic and speaker information, but we still lack a proper understanding of knowledge encoded at the word level and beyond. In this work, we use lightweight analysis methods to study segment-level linguistic properties -- word identity, boundaries, pronunciation, syntactic features, and semantic features -- encoded in S3Ms. We present a comparative study of layer-wise representations from ten S3Ms and find that (i) the frame-level representations within each word segment are not all equally informative, and (ii) the pre-training objective and model size heavily influence the accessibility and distribution of linguistic information across layers. We also find that on several tasks -- word discrimination, word segmentation, and semantic sentence similarity -- S3Ms trained with visual grounding outperform their speech-only counterparts. Finally, our task-based analyses demonstrate an improved performance on word segmentation and acoustic word discrimination while using simpler methods than prior work.
翻译:近年来,多种自监督语音模型被提出,在各类语音任务中提升了性能与数据效率。然而,这些实证成功本身并未完整揭示预训练过程中模型习得的知识。已有研究开始分析自监督语音模型如何编码音素、说话人等特定属性,但对于词汇层面及更高层次的编码知识仍缺乏深入理解。本研究采用轻量级分析方法,探究自监督语音模型中编码的片段级语言属性——词汇身份、边界、发音、句法特征及语义特征。通过对十种自监督语音模型的逐层表征进行对比分析,我们发现:(i)词汇片段内部的帧级表征并非同等重要,(ii)预训练目标与模型规模显著影响各层语言信息的可获取性与分布模式。此外,在词汇判别、词边界切分及语义句相似度等任务中,经视觉接地训练的自监督语音模型表现优于仅依赖语音信号的模型。最终,我们的任务分析表明,采用比前人更简洁的方法,即可在词边界切分与声学词汇判别任务上取得更优性能。