Iterative self-training, or iterative pseudo-labeling (IPL)--using an improved model from the current iteration to provide pseudo-labels for the next iteration--has proven to be a powerful approach to enhance the quality of speaker representations. Recent applications of IPL in unsupervised speaker recognition start with representations extracted from very elaborate self-supervised methods (e.g., DINO). However, training such strong self-supervised models is not straightforward (they require hyper-parameters tuning and may not generalize to out-of-domain data) and, moreover, may not be needed at all. To this end, we show the simple, well-studied, and established i-vector generative model is enough to bootstrap the IPL process for unsupervised learning of speaker representations. We also systematically study the impact of other components on the IPL process, which includes the initial model, the encoder, augmentations, the number of clusters, and the clustering algorithm. Remarkably, we find that even with a simple and significantly weaker initial model like i-vector, IPL can still achieve speaker verification performance that rivals state-of-the-art methods.
翻译:迭代自训练(或称迭代伪标签法,IPL)——即使用当前迭代的改进模型为下一次迭代提供伪标签——已被证明是提升说话人表征质量的有效方法。近期IPL在无监督说话人识别中的应用均始于从复杂自监督方法(如DINO)中提取的表征。然而,训练此类强自监督模型并不直接(它们需要超参数调优且可能无法泛化至域外数据),更重要的是可能并非必要。为此,我们证明简单、经过充分研究且成熟的i-vector生成模型足以引导IPL过程进行说话人表征的无监督学习。我们还系统研究了其他组件对IPL过程的影响,包括初始模型、编码器、数据增强、聚类数量和聚类算法。值得注意的是,我们发现即使采用i-vector这样简单且显著较弱的初始模型,IPL仍能达到与最先进方法相媲美的说话人验证性能。