Self-supervised speech models (S3Ms) have become a common tool for the speech processing community, leveraging representations for downstream tasks. Clustering S3M representations yields discrete speech units (DSUs), which serve as compact representations for speech signals. DSUs are typically obtained by k-means clustering. Using DSUs often leads to strong performance in various tasks, including automatic speech recognition (ASR). However, even with the high dimensionality and redundancy of S3M representations, preprocessing S3M representations for better clustering remains unexplored, even though it can affect the quality of DSUs. In this paper, we investigate the potential of linear preprocessing methods for extracting DSUs. We evaluate standardization, principal component analysis, whitening, and independent component analysis (ICA) on DSU-based ASR benchmarks and demonstrate their effectiveness as preprocessing for k-means. We also conduct extensive analyses of their behavior, such as orthogonality or interpretability of individual components of ICA.
翻译:自监督语音模型已成为语音处理领域的常用工具,通过其表征能力支持下游任务。对自监督语音模型表征进行聚类可得到离散语音单元,该单元可作为语音信号的紧凑表征。离散语音单元通常通过k均值聚类获得。使用离散语音单元在多类任务中常能实现优异性能,包括自动语音识别。然而,尽管自监督语音模型表征具有高维度和冗余特性,为提升聚类效果而对表征进行预处理的研究仍属空白,而该预处理可能影响离散语音单元的质量。本文研究了线性预处理方法在提取离散语音单元中的潜力。我们在基于离散语音单元的自动语音识别基准测试中评估了标准化、主成分分析、白化处理及独立成分分析等方法,并论证了它们作为k均值聚类预处理的有效性。同时,我们对这些方法的行为特性进行了深入分析,例如独立成分分析各分量的正交性或可解释性。