Modern speaker verification systems primarily rely on speaker embeddings, followed by verification based on cosine similarity between the embedding vectors of the enrollment and test utterances. While effective, these methods struggle with multi-talker speech due to the unidentifiability of embedding vectors. In this paper, we propose Neural Scoring (NS), a refreshed end-to-end framework that directly estimates verification posterior probabilities without relying on test-side embeddings, making it more robust to complex conditions, e.g., with multiple talkers. To make the training of such an end-to-end model more efficient, we introduce a large-scale trial e2e training (LtE2E) strategy, where each test utterance pairs with a set of enrolled speakers, thus enabling the processing of large-scale verification trials per batch. Experiments on the VoxCeleb dataset demonstrate that NS consistently outperforms both the baseline and competitive methods across various conditions, achieving an overall 70.36% reduction in EER compared to the baseline.
翻译:现代说话人验证系统主要依赖于说话人嵌入,随后基于注册语音与测试语音的嵌入向量之间的余弦相似度进行验证。尽管这些方法有效,但由于嵌入向量的不可识别性,它们在处理多人语音时存在困难。本文提出神经评分(NS),一种新的端到端框架,它直接估计验证后验概率,而不依赖于测试端嵌入,从而使其对复杂条件(例如多人语音)更具鲁棒性。为了使这种端到端模型的训练更高效,我们引入了大规模试验端到端训练(LtE2E)策略,其中每个测试语音与一组注册说话人配对,从而能够在每批次处理大规模验证试验。在VoxCeleb数据集上的实验表明,NS在各种条件下均持续优于基线和竞争方法,与基线相比,EER总体降低了70.36%。