State-of-the-art speaker verification frameworks have typically focused on developing models with increasingly deeper (more layers) and wider (number of channels) models to improve their verification performance. Instead, this paper proposes an approach to increase the model resolution capability using attention-based dynamic kernels in a convolutional neural network to adapt the model parameters to be feature-conditioned. The attention weights on the kernels are further distilled by channel attention and multi-layer feature aggregation to learn global features from speech. This approach provides an efficient solution to improving representation capacity with lower data resources. This is due to the self-adaptation to inputs of the structures of the model parameters. The proposed dynamic convolutional model achieved 1.62\% EER and 0.18 miniDCF on the VoxCeleb1 test set and has a 17\% relative improvement compared to the ECAPA-TDNN using the same training resources.
翻译:目前最先进的说话人确认框架通常致力于开发更深(更多层)和更宽(更多通道)的模型以提升验证性能。相反,本文提出了一种方法,通过使用基于注意力的动态核的卷积神经网络来增强模型分辨率能力,使模型参数适应于特征条件。核上的注意力权重进一步通过通道注意力和多层特征聚合进行精炼,以从语音中学习全局特征。该方法提供了一种高效方案,可在较低数据资源下提升表示能力,这源于模型参数结构对输入的自适应能力。所提出的动态卷积模型在VoxCeleb1测试集上实现了1.62%的等错误率(EER)和0.18的最小检测代价函数(miniDCF),相较于使用相同训练资源的ECAPA-TDNN模型获得了17%的相对提升。