Self-supervised representations of speech are currently being widely used for a large number of applications. Recently, some efforts have been made in trying to analyze the type of information present in each of these representations. Most such work uses downstream models to test whether the representations can be successfully used for a specific task. The downstream models, though, typically perform nonlinear operations on the representation extracting information that may not have been readily available in the original representation. In this work, we analyze the spatial organization of phone and speaker information in several state-of-the-art speech representations using methods that do not require a downstream model. We measure how different layers encode basic acoustic parameters such as formants and pitch using representation similarity analysis. Further, we study the extent to which each representation clusters the speech samples by phone or speaker classes using non-parametric statistical testing. Our results indicate that models represent these speech attributes differently depending on the target task used during pretraining.
翻译:自监督语音表征正被广泛应用于大量任务中。近期,已有研究尝试分析每种表征中包含的信息类型。这类工作大多使用下游模型来测试表征是否能成功用于特定任务。然而,下游模型通常对表征执行非线性操作,从而提取原始表征中可能并非易于获取的信息。本研究采用无需下游模型的方法,分析多种最先进语音表征中音素与说话人信息的空间组织方式。我们利用表征相似性分析测量不同层如何编码共振峰、音高等基本声学参数,并通过非参数统计检验研究每种表征将语音样本按音素或说话人类别聚类的程度。结果表明,模型对语音属性的表征方式取决于预训练阶段的目标任务。