Learning phone types from phone instances has been a long-standing problem, while still being open. In this work, we revisit this problem in the context of self-supervised learning, and pose it as the problem of matching cluster centroids to phone embeddings. We study two key properties that enable matching, namely, whether cluster centroids of self-supervised representations reduce the variability of phone instances and respect the relationship among phones. We then use the matching result to produce pseudo-labels and introduce a new loss function for improving self-supervised representations. Our experiments show that the matching result captures the relationship among phones. Training the new loss function jointly with the regular self-supervised losses, such as APC and CPC, significantly improves the downstream phone classification.
翻译:从音素实例中学习音素类型是一个长期存在但尚未完全解决的问题。本文在自监督学习的背景下重新审视该问题,并将其形式化为匹配聚类中心与音素嵌入的问题。我们研究了实现匹配的两个关键特性,即自监督表征的聚类中心能否减少音素实例的变异性,并尊重音素之间的关系。随后,我们利用匹配结果生成伪标签,并引入一种新的损失函数以改善自监督表征。实验表明,匹配结果能够捕捉音素之间的关系。将新损失函数与常规自监督损失(如APC和CPC)联合训练,能显著提升下游音素分类性能。