Self-supervised learning (SSL) models have become crucial in speech processing, with recent advancements concentrating on developing architectures that capture representations across multiple timescales. The primary goal of these multi-scale architectures is to exploit the hierarchical nature of speech, where lower-resolution components aim to capture representations that align with increasingly abstract concepts (e.g., from phones to words to sentences). Although multi-scale approaches have demonstrated some improvements over single-scale models, the precise reasons for these enhancements have poor empirical support. In this study, we present an initial analysis of layer-wise representations in multi-scale architectures, with a focus on Canonical Correlation Analysis (CCA) and Mutual Information (MI). We apply this analysis to Multi-Resolution HuBERT (MR-HuBERT) and find that (1) the improved performance on SUPERB tasks is primarily due to the auxiliary low-resolution loss rather than the downsampling itself, and (2) downsampling to lower resolutions neither improves downstream performance nor correlates with higher-level information (e.g., words), though it does improve computational efficiency. These findings challenge assumptions about the multi-scale nature of MR-HuBERT and motivate the importance of disentangling computational efficiency from learning better representations.
翻译:自监督学习(SSL)模型在语音处理中已变得至关重要,近期进展主要集中在开发能够捕获多时间尺度表征的架构上。这些多尺度架构的主要目标是利用语音的层次化特性,其中较低分辨率组件旨在捕获与日益抽象概念(如从音素到词语再到句子)对齐的表征。尽管多尺度方法已显示出相较于单尺度模型的一些改进,但这些提升的确切原因缺乏充分的实证支持。在本研究中,我们对多尺度架构中的分层表征进行了初步分析,重点关注典型相关分析(CCA)和互信息(MI)。我们将此分析应用于多分辨率HuBERT(MR-HuBERT)模型,发现:(1)在SUPERB任务上性能的提升主要源于辅助性低分辨率损失函数,而非下采样操作本身;(2)下采样至较低分辨率既未改善下游任务性能,也未与更高层次信息(如词语)建立关联,尽管其确实提高了计算效率。这些发现挑战了关于MR-HuBERT多尺度特性的既有假设,并强调了将计算效率与学习更优表征进行解耦分析的重要性。