Speaker verification is to judge the similarity between two unknown voices in an open set, where the ideal speaker embedding should be able to condense discriminant information into a compact utterance-level representation that has small intra-speaker distances and large inter-speaker distances. We propose Voice Transformer (VOT), a novel model for speaker verification, which integrates parallel transformers at multiple scales. A deep feedforward sequential memory network (DFSMN) is incorporated into the attention part of these transformers to increase feature granularity. The attentive statistics pooling layer is added to focus on important frames and form utterance-level features. We propose Additive Angular Margin Focal Loss (AAMF) to address the hard samples problem. We evaluate the proposed approach on the VoxCeleb1 and CN-Celeb2 datasets, demonstrating that VOT surpasses most mainstream models. The code is available on GitHub\footnote{\url{https://github.com/luckyerr/Voice-Transformer_Speaker-Verification}}.
翻译:说话人验证旨在开放集合中判断两个未知语音的相似性,理想的说话人嵌入应能将判别信息压缩为紧凑的 utterance-level 表征,使其具有较小的类内距离和较大的类间距离。本文提出 Voice Transformer (VOT)——一种用于说话人验证的新型模型,该模型在多个尺度集成并行 Transformer。我们在这些 Transformer 的注意力部分引入深度前馈序列记忆网络 (DFSMN) 以增强特征粒度,并加入注意力统计池化层以聚焦关键帧并形成 utterance-level 特征。针对困难样本问题,我们提出加性角间隔焦点损失 (AAMF)。在 VoxCeleb1 和 CN-Celeb2 数据集上的实验表明,VOT 性能优于多数主流模型。代码已发布于 GitHub\footnote{\url{https://github.com/luckyerr/Voice-Transformer_Speaker-Verification}}。