Recently, Transformer-based architectures have been explored for speaker embedding extraction. Although the Transformer employs the self-attention mechanism to efficiently model the global interaction between token embeddings, it is inadequate for capturing short-range local context, which is essential for the accurate extraction of speaker information. In this study, we enhance the Transformer with the enhanced locality modeling in two directions. First, we propose the Locality-Enhanced Conformer (LE-Confomer) by introducing depth-wise convolution and channel-wise attention into the Conformer blocks. Second, we present the Speaker Swin Transformer (SST) by adapting the Swin Transformer, originally proposed for vision tasks, into speaker embedding network. We evaluate the proposed approaches on the VoxCeleb datasets and a large-scale Microsoft internal multilingual (MS-internal) dataset. The proposed models achieve 0.75% EER on VoxCeleb 1 test set, outperforming the previously proposed Transformer-based models and CNN-based models, such as ResNet34 and ECAPA-TDNN. When trained on the MS-internal dataset, the proposed models achieve promising results with 14.6% relative reduction in EER over the Res2Net50 model.
翻译:近年来,基于Transformer架构的方法被探索用于说话人嵌入提取。尽管Transformer利用自注意力机制有效建模词元嵌入间的全局交互,但其在捕捉对于准确提取说话人信息至关重要的短程局部上下文方面存在不足。本研究从两个方向增强Transformer的局部建模能力:首先,我们提出局部增强型Conformer(LE-Confomer),通过在Conformer模块中引入深度可分离卷积和通道注意力机制;其次,我们提出说话人Swin Transformer(SST),将原用于视觉任务的Swin Transformer适配为说话人嵌入网络。我们在VoxCeleb数据集和微软内部大规模多语言(MS-internal)数据集上评估所提方法。所提模型在VoxCeleb 1测试集上达到0.75%的等错误率(EER),优于此前提出的基于Transformer的模型和基于CNN的模型(如ResNet34和ECAPA-TDNN)。在MS-internal数据集上训练时,所提模型相较Res2Net50模型实现了14.6%的相对EER降低,展现出有竞争力的结果。