This paper explores three novel approaches to improve the performance of speaker verification (SV) systems based on deep neural networks (DNN) using Multi-head Self-Attention (MSA) mechanisms and memory layers. Firstly, we propose the use of a learnable vector called Class token to replace the average global pooling mechanism to extract the embeddings. Unlike global average pooling, our proposal takes into account the temporal structure of the input what is relevant for the text-dependent SV task. The class token is concatenated to the input before the first MSA layer, and its state at the output is used to predict the classes. To gain additional robustness, we introduce two approaches. First, we have developed a Bayesian estimation of the class token. Second, we have added a distilled representation token for training a teacher-student pair of networks using the Knowledge Distillation (KD) philosophy, which is combined with the class token. This distillation token is trained to mimic the predictions from the teacher network, while the class token replicates the true label. All the strategies have been tested on the RSR2015-Part II and DeepMine-Part 1 databases for text-dependent SV, providing competitive results compared to the same architecture using the average pooling mechanism to extract average embeddings.
翻译:本文提出了三种新颖方法,以提升基于深度神经网络(DNN)且采用多头自注意力(MSA)机制与记忆层的说话人验证(SV)系统性能。首先,我们提出使用名为“类令牌”的可学习向量替代全局平均池化机制来提取嵌入特征。与全局平均池化不同,本方法考虑了输入的时间结构,这对于文本相关的SV任务至关重要。类令牌在首个MSA层前与输入拼接,其输出状态用于类别预测。为增强鲁棒性,我们进一步引入两种方法:其一,开发了类令牌的贝叶斯估计;其二,添加蒸馏表征令牌,基于知识蒸馏(KD)理念训练师生网络对,并与类令牌协同工作。该蒸馏令牌用于模仿教师网络的预测,而类令牌则复制真实标签。所有策略均在RSR2015-Part II和DeepMine-Part 1数据库上针对文本相关SV任务进行测试,相较于使用平均池化机制提取平均嵌入的相同架构,本方法取得了具有竞争力的结果。