Speaker verification at large scale remains an open challenge as fixed-margin losses treat all samples equally regardless of quality. We hypothesize that mislabeled or degraded samples introduce noisy gradients that disrupt compact speaker manifolds. We propose Curry (CURriculum Ranking), an adaptive loss that estimates sample difficulty online via Sub-center ArcFace: confidence scores from dominant sub-center cosine similarity rank samples into easy, medium, and hard tiers using running batch statistics, without auxiliary annotations. Learnable weights guide the model from stable identity foundations through manifold refinement to boundary sharpening. To our knowledge, this is the largest-scale speaker verification system trained to date. Evaluated on VoxCeleb1-O, and SITW, Curry reduces EER by 86.8\% and 60.0\% over the Sub-center ArcFace baseline, establishing a new paradigm for robust speaker verification on imperfect large-scale data.
翻译:大规模说话人验证仍是一个开放挑战,因为固定边距损失对所有样本一视同仁,忽略样本质量。我们假设错误标注或退化样本会产生噪声梯度,破坏紧凑的说话人流形。为此提出Curry(课程学习排序),一种自适应损失函数,通过子中心ArcFace在线估计样本难度:利用主导子中心余弦相似度得出的置信度分数,基于运行批次统计将样本划分为易、中、难三个等级,无需额外标注。可学习权重引导模型从稳定身份基础出发,历经流形精炼到边界锐化。据我们所知,这是迄今训练的最大规模说话人验证系统。在VoxCeleb1-O和SITW评估中,Curry相比子中心ArcFace基线将等错误率分别降低86.8%和60.0%,为不完美大规模数据上的鲁棒说话人验证建立了新范式。