This paper is concerned with the task of speaker verification on audio with multiple overlapping speakers. Most speaker verification systems are designed with the assumption of a single speaker being present in a given audio segment. However, in a real-world setting this assumption does not always hold. In this paper, we demonstrate that current speaker verification systems are not robust against audio with noticeable speaker overlap. To alleviate this issue, we propose margin-mixup, a simple training strategy that can easily be adopted by existing speaker verification pipelines to make the resulting speaker embeddings robust against multi-speaker audio. In contrast to other methods, margin-mixup requires no alterations to regular speaker verification architectures, while attaining better results. On our multi-speaker test set based on VoxCeleb1, the proposed margin-mixup strategy improves the EER on average with 44.4% relative to our state-of-the-art speaker verification baseline systems.
翻译:本文关注于存在多个重叠说话人的音频中的说话人验证任务。大多数说话人验证系统在设计时假设给定音频片段中仅存在单一说话人。然而,在实际场景中,这一假设并不总是成立。本文证明,当前说话人验证系统对存在明显说话人重叠的音频缺乏鲁棒性。为解决此问题,我们提出margin-mixup——一种简单的训练策略,可便捷地集成至现有说话人验证流水线中,使生成的说话人嵌入对多说话人音频具有鲁棒性。与其他方法相比,margin-mixup无需改动常规说话人验证架构,同时能取得更优结果。在基于VoxCeleb1构建的多说话人测试集上,所提出的margin-mixup策略相较于当前最先进的说话人验证基线系统,平均等错误率(EER)相对提升了44.4%。