VOT: Revolutionizing Speaker Verification with Memory and Attention Mechanisms

Speaker verification is essentially the process of identifying unknown speakers within an 'open set'. Our objective is to create optimal embeddings that condense information into concise speech-level representations, ensuring short distances within the same speaker and long distances between different speakers. Despite the prevalence of self-attention and convolution methods in speaker verification, they grapple with the challenge of high computational complexity.In order to surmount the limitations posed by the Transformer in extracting local features and the computational intricacies of multilayer convolution, we introduce the Memory-Attention framework. This framework incorporates a deep feed-forward temporal memory network (DFSMN) into the self-attention mechanism, capturing long-term context by stacking multiple layers and enhancing the modeling of local dependencies. Building upon this, we design a novel model called VOT, utilizing a parallel variable weight summation structure and introducing an attention-based statistical pooling layer.To address the hard sample mining problem, we enhance the AM-Softmax loss function and propose a new loss function named AM-Softmax-Focal. Experimental results on the VoxCeleb1 dataset not only showcase a significant improvement in system performance but also surpass the majority of mainstream models, validating the importance of local information in the speaker verification task. The code will be available on GitHub.

翻译：说话人验证本质上是在“开放集”中识别未知说话人的过程。我们的目标是创建最优嵌入，将信息压缩为简洁的语音级表征，确保同类说话人间距离最小化，异类说话人间距离最大化。尽管自注意力和卷积方法在说话人验证中广泛应用，但它们面临高计算复杂度的挑战。为克服Transformer在提取局部特征方面的局限性和多层卷积的计算复杂性，我们引入了记忆-注意力框架。该框架将深层前馈时序记忆网络（DFSMN）融入自注意力机制，通过堆叠多层捕获长程上下文，并增强局部依赖的建模能力。在此基础上，我们设计了一种名为VOT的新型模型，采用并行可变权重求和结构，并引入了基于注意力的统计池化层。针对难样本挖掘问题，我们改进了AM-Softmax损失函数，并提出了一种名为AM-Softmax-Focal的新损失函数。在VoxCeleb1数据集上的实验不仅展示了系统性能的显著提升，而且超越了大多数主流模型，验证了局部信息在说话人验证任务中的重要性。代码将在GitHub上公开。

相关内容

损失函数（机器学习）

关注 10

损失函数，在AI中亦称呼距离函数，度量函数。此处的距离代表的是抽象性的，代表真实数据与预测数据之间的误差。损失函数（loss function）是用来估量你模型的预测值f(x)与真实值Y的不一致程度，它是一个非负实值函数,通常使用L(Y, f(x))来表示，损失函数越小，模型的鲁棒性就越好。损失函数是经验风险函数的核心部分，也是结构风险函数重要组成部分。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日