Learning an effective speaker representation is crucial for achieving reliable performance in speaker verification tasks. Speech signals are high-dimensional, long, and variable-length sequences containing diverse information at each time-frequency (TF) location. The standard convolutional layer that operates on neighboring local regions often fails to capture the complex TF global information. Our motivation is to alleviate these challenges by increasing the modeling capacity, emphasizing significant information, and suppressing possible redundancies. We aim to design a more robust and efficient speaker recognition system by incorporating the benefits of attention mechanisms and Discrete Cosine Transform (DCT) based signal processing techniques, to effectively represent the global information in speech signals. To achieve this, we propose a general global time-frequency context modeling block for speaker modeling. First, an attention-based context model is introduced to capture the long-range and non-local relationship across different time-frequency locations. Second, a 2D-DCT based context model is proposed to improve model efficiency and examine the benefits of signal modeling. A multi-DCT attention mechanism is presented to improve modeling power with alternate DCT base forms. Finally, the global context information is used to recalibrate salient time-frequency locations by computing the similarity between the global context and local features. This effectively improves the speaker verification performance compared to the standard ResNet model and Squeeze & Excitation block by a large margin. Our experimental results show that the proposed global context modeling method can efficiently improve the learned speaker representations by achieving channel-wise and time-frequency feature recalibration.
翻译:学习有效的说话人表征对于在说话人验证任务中实现可靠性能至关重要。语音信号是包含不同时间频率位置丰富信息的高维、长时且长度可变的序列。标准卷积层通常仅处理相邻局部区域,难以捕获复杂的时频全局信息。我们的动机是通过增强建模能力、突出重要信息并抑制可能的冗余来缓解这些挑战。我们旨在结合注意力机制与基于离散余弦变换的信号处理技术优势,设计更鲁棒高效的说话人识别系统,以有效表征语音信号中的全局信息。为此,我们提出了一种通用的全局时频上下文建模模块用于说话人建模。首先,引入基于注意力的上下文模型以捕捉不同时频位置间的长程非局部关系。其次,提出基于二维DCT的上下文模型以提升模型效率并探究信号建模的收益。通过采用多DCT注意力机制,利用交替的DCT基函数提升建模能力。最终,通过计算全局上下文与局部特征的相似性,利用全局上下文信息重新校准显著性时频位置。该方法相较于标准ResNet模型与Squeeze & Excitation模块在说话人验证性能上取得显著提升。实验结果表明,所提出的全局上下文建模方法可通过通道维与时频维的特征重标定有效改善学习到的说话人表征。