Self-supervised speech representations can hugely benefit downstream speech technologies, yet the properties that make them useful are still poorly understood. Two candidate properties related to the geometry of the representation space have been hypothesized to correlate well with downstream tasks: (1) the degree of orthogonality between the subspaces spanned by the speaker centroids and phone centroids, and (2) the isotropy of the space, i.e., the degree to which all dimensions are effectively utilized. To study them, we introduce a new measure, Cumulative Residual Variance (CRV), which can be used to assess both properties. Using linear classifiers for speaker and phone ID to probe the representations of six different self-supervised models and two untrained baselines, we ask whether either orthogonality or isotropy correlate with linear probing accuracy. We find that both measures correlate with phonetic probing accuracy, though our results on isotropy are more nuanced.
翻译:自监督语音表征能极大促进下游语音技术的发展,但其有效性背后的特性仍未得到充分理解。已有假设认为,与表征空间几何结构相关的两个特性可能与下游任务表现存在关联:(1) 说话人质心张成的子空间与音素质心张成的子空间之间的正交程度;(2) 空间的各向同性,即所有维度被有效利用的程度。为研究这些特性,我们提出了一种新的度量指标——累积残差方差(CRV),该指标可同时评估上述两种特性。通过使用说话人识别与音素识别的线性分类器对六种不同自监督模型及两种未训练基线模型的表征进行探测,我们探究了正交性或各向同性是否与线性探测准确率存在相关性。研究发现,两种度量指标均与音素探测准确率相关,但各向同性的实验结果呈现出更复杂的特性。