Attention and DCT based Global Context Modeling for Text-independent Speaker Recognition

Learning an effective speaker representation is crucial for achieving reliable performance in speaker verification tasks. Speech signals are high-dimensional, long, and variable-length sequences containing diverse information at each time-frequency (TF) location. The standard convolutional layer that operates on neighboring local regions often fails to capture the complex TF global information. Our motivation is to alleviate these challenges by increasing the modeling capacity, emphasizing significant information, and suppressing possible redundancies. We aim to design a more robust and efficient speaker recognition system by incorporating the benefits of attention mechanisms and Discrete Cosine Transform (DCT) based signal processing techniques, to effectively represent the global information in speech signals. To achieve this, we propose a general global time-frequency context modeling block for speaker modeling. First, an attention-based context model is introduced to capture the long-range and non-local relationship across different time-frequency locations. Second, a 2D-DCT based context model is proposed to improve model efficiency and examine the benefits of signal modeling. A multi-DCT attention mechanism is presented to improve modeling power with alternate DCT base forms. Finally, the global context information is used to recalibrate salient time-frequency locations by computing the similarity between the global context and local features. This effectively improves the speaker verification performance compared to the standard ResNet model and Squeeze & Excitation block by a large margin. Our experimental results show that the proposed global context modeling method can efficiently improve the learned speaker representations by achieving channel-wise and time-frequency feature recalibration.

翻译：学习有效的说话人表征对于在说话人验证任务中实现可靠性能至关重要。语音信号是包含不同时间频率位置丰富信息的高维、长时且长度可变的序列。标准卷积层通常仅处理相邻局部区域，难以捕获复杂的时频全局信息。我们的动机是通过增强建模能力、突出重要信息并抑制可能的冗余来缓解这些挑战。我们旨在结合注意力机制与基于离散余弦变换的信号处理技术优势，设计更鲁棒高效的说话人识别系统，以有效表征语音信号中的全局信息。为此，我们提出了一种通用的全局时频上下文建模模块用于说话人建模。首先，引入基于注意力的上下文模型以捕捉不同时频位置间的长程非局部关系。其次，提出基于二维DCT的上下文模型以提升模型效率并探究信号建模的收益。通过采用多DCT注意力机制，利用交替的DCT基函数提升建模能力。最终，通过计算全局上下文与局部特征的相似性，利用全局上下文信息重新校准显著性时频位置。该方法相较于标准ResNet模型与Squeeze & Excitation模块在说话人验证性能上取得显著提升。实验结果表明，所提出的全局上下文建模方法可通过通道维与时频维的特征重标定有效改善学习到的说话人表征。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日