Speaker verification is essentially the process of identifying unknown speakers within an 'open set'. Our objective is to create optimal embeddings that condense information into concise speech-level representations, ensuring short distances within the same speaker and long distances between different speakers. Despite the prevalence of self-attention and convolution methods in speaker verification, they grapple with the challenge of high computational complexity.In order to surmount the limitations posed by the Transformer in extracting local features and the computational intricacies of multilayer convolution, we introduce the Memory-Attention framework. This framework incorporates a deep feed-forward temporal memory network (DFSMN) into the self-attention mechanism, capturing long-term context by stacking multiple layers and enhancing the modeling of local dependencies. Building upon this, we design a novel model called VOT, utilizing a parallel variable weight summation structure and introducing an attention-based statistical pooling layer.To address the hard sample mining problem, we enhance the AM-Softmax loss function and propose a new loss function named AM-Softmax-Focal. Experimental results on the VoxCeleb1 dataset not only showcase a significant improvement in system performance but also surpass the majority of mainstream models, validating the importance of local information in the speaker verification task. The code will be available on GitHub.
翻译:说话人验证本质上是在“开放集”中识别未知说话人的过程。我们的目标是创建最优嵌入,将信息压缩为简洁的语音级表征,确保同类说话人间距离最小化,异类说话人间距离最大化。尽管自注意力和卷积方法在说话人验证中广泛应用,但它们面临高计算复杂度的挑战。为克服Transformer在提取局部特征方面的局限性和多层卷积的计算复杂性,我们引入了记忆-注意力框架。该框架将深层前馈时序记忆网络(DFSMN)融入自注意力机制,通过堆叠多层捕获长程上下文,并增强局部依赖的建模能力。在此基础上,我们设计了一种名为VOT的新型模型,采用并行可变权重求和结构,并引入了基于注意力的统计池化层。针对难样本挖掘问题,我们改进了AM-Softmax损失函数,并提出了一种名为AM-Softmax-Focal的新损失函数。在VoxCeleb1数据集上的实验不仅展示了系统性能的显著提升,而且超越了大多数主流模型,验证了局部信息在说话人验证任务中的重要性。代码将在GitHub上公开。