Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Recently, Transformer-based architectures have been explored for speaker embedding extraction. Although the Transformer employs the self-attention mechanism to efficiently model the global interaction between token embeddings, it is inadequate for capturing short-range local context, which is essential for the accurate extraction of speaker information. In this study, we enhance the Transformer with the enhanced locality modeling in two directions. First, we propose the Locality-Enhanced Conformer (LE-Confomer) by introducing depth-wise convolution and channel-wise attention into the Conformer blocks. Second, we present the Speaker Swin Transformer (SST) by adapting the Swin Transformer, originally proposed for vision tasks, into speaker embedding network. We evaluate the proposed approaches on the VoxCeleb datasets and a large-scale Microsoft internal multilingual (MS-internal) dataset. The proposed models achieve 0.75% EER on VoxCeleb 1 test set, outperforming the previously proposed Transformer-based models and CNN-based models, such as ResNet34 and ECAPA-TDNN. When trained on the MS-internal dataset, the proposed models achieve promising results with 14.6% relative reduction in EER over the Res2Net50 model.

翻译：近年来，基于Transformer架构的方法被探索用于说话人嵌入提取。尽管Transformer利用自注意力机制有效建模词元嵌入间的全局交互，但其在捕捉对于准确提取说话人信息至关重要的短程局部上下文方面存在不足。本研究从两个方向增强Transformer的局部建模能力：首先，我们提出局部增强型Conformer（LE-Confomer），通过在Conformer模块中引入深度可分离卷积和通道注意力机制；其次，我们提出说话人Swin Transformer（SST），将原用于视觉任务的Swin Transformer适配为说话人嵌入网络。我们在VoxCeleb数据集和微软内部大规模多语言（MS-internal）数据集上评估所提方法。所提模型在VoxCeleb 1测试集上达到0.75%的等错误率（EER），优于此前提出的基于Transformer的模型和基于CNN的模型（如ResNet34和ECAPA-TDNN）。在MS-internal数据集上训练时，所提模型相较Res2Net50模型实现了14.6%的相对EER降低，展现出有竞争力的结果。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！700+ppt《因果推理》课程！杜克大学Fan Li教程

专知会员服务

73+阅读 · 2022年7月11日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日

一份简单《图神经网络》教程，28页ppt

专知会员服务

127+阅读 · 2020年8月2日