Iterative self-training, or iterative pseudo-labeling (IPL) -- using an improved model from the current iteration to provide pseudo-labels for the next iteration -- has proven to be a powerful approach to enhance the quality of speaker representations. Recent applications of IPL in unsupervised speaker recognition start with representations extracted from very elaborate self-supervised methods (e.g., DINO). However, training such strong self-supervised models is not straightforward (they require hyper-parameter tuning and may not generalize to out-of-domain data) and, moreover, may not be needed at all. To this end, we show that the simple, well-studied, and established i-vector generative model is enough to bootstrap the IPL process for the unsupervised learning of speaker representations. We also systematically study the impact of other components on the IPL process, which includes the initial model, the encoder, augmentations, the number of clusters, and the clustering algorithm. Remarkably, we find that even with a simple and significantly weaker initial model like i-vector, IPL can still achieve speaker verification performance that rivals state-of-the-art methods.
翻译:迭代自训练(或称迭代伪标签法,IPL)——即使用当前迭代的改进模型为下一次迭代提供伪标签——已被证明是提升说话人表征质量的有效方法。近期在无监督说话人识别中应用的IPL方法,通常以复杂自监督方法(如DINO)提取的表征作为起点。然而,训练此类强自监督模型并不简单(它们需要超参数调优且可能无法泛化至域外数据),并且可能并非必要。为此,我们证明简单、经过充分研究且成熟的i-vector生成模型足以引导IPL过程,实现说话人表征的无监督学习。我们还系统研究了其他组件对IPL过程的影响,包括初始模型、编码器、数据增强策略、聚类数量以及聚类算法。值得注意的是,我们发现即使采用如i-vector这样简单且显著较弱的初始模型,IPL仍能达到与最先进方法相媲美的说话人验证性能。