This work profoundly analyzes discrete self-supervised speech representations (units) through the eyes of Generative Spoken Language Modeling (GSLM). Following the findings of such an analysis, we propose practical improvements to the discrete unit for the GSLM. First, we start comprehending these units by analyzing them in three axes: interpretation, visualization, and resynthesis. Our analysis finds a high correlation between the speech units to phonemes and phoneme families, while their correlation with speaker or gender is weaker. Additionally, we found redundancies in the extracted units and claim that one reason may be the units' context. Following this analysis, we propose a new, unsupervised metric to measure unit redundancies. Finally, we use this metric to develop new methods that improve the robustness of units' clustering and show significant improvement considering zero-resource speech metrics such as ABX. Code and analysis tools are available under the following link: https://github.com/slp-rl/SLM-Discrete-Representations
翻译:本文通过生成式口语建模(GSLM)的视角,深入分析了离散自监督语音表示(单元)。基于该分析结果,我们提出了针对GSLM离散单元的实际改进方案。首先,我们从解释性、可视化和重构三个维度对这些单元进行理解性分析。分析发现,语音单元与音素及音素家族之间存在高度相关性,而与说话者或性别的相关性较弱。此外,我们观察到提取的单元中存在冗余,并认为上下文可能是造成该冗余的原因之一。基于此分析,我们提出了一种新的无监督指标来度量单元冗余度。最后,利用该指标开发了新方法,显著提升了单元聚类的鲁棒性,并在零资源语音指标(如ABX)上展现了大幅改进。相关代码与分析工具可参见以下链接:https://github.com/slp-rl/SLM-Discrete-Representations