Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification

from arxiv, Accepted to IEEE/ACM Transactions on Audio, Speech, and Language Processing. Copyright may be transferred without notice, after which this version may no longer be accessible

Previous studies demonstrate the impressive performance of residual neural networks (ResNet) in speaker verification. The ResNet models treat the time and frequency dimensions equally. They follow the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech representation. In this paper, we address this issue and look for optimal stride configurations specifically tailored for speaker verification. We represent the stride space on a trellis diagram, and conduct a systematic study on the impact of temporal and frequency resolutions on the performance and further identify two optimal points, namely Golden Gemini, which serves as a guiding principle for designing 2D ResNet-based speaker verification models. By following the principle, a state-of-the-art ResNet baseline model gains a significant performance improvement on VoxCeleb, SITW, and CNCeleb datasets with 7.70%/11.76% average EER/minDCF reductions, respectively, across different network depths (ResNet18, 34, 50, and 101), while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to it as Gemini ResNet. Further investigation reveals the efficacy of the proposed Golden Gemini operating points across various training conditions and architectures. Furthermore, we present a new benchmark, namely the Gemini DF-ResNet, using a cutting-edge model.

翻译：先前研究表明，残差神经网络（ResNet）在说话人验证中展现出卓越性能。ResNet模型将时间与频率维度同等对待，遵循为图像识别设计的默认步幅配置（此类任务中水平与垂直轴具有相似性）。然而，这种处理方法忽略了语音表征中时间与频率的不对称特性。本文针对该问题，探索专为说话人验证优化的最优步幅配置。我们使用网格图表示步幅空间，系统研究时间与频率分辨率对性能的影响，并进一步识别出两个最优操作点，命名为"黄金双子座"，这为设计基于二维ResNet的说话人验证模型提供了指导原则。遵循该原则，现有最先进的ResNet基线模型在VoxCeleb、SITW和CNCeleb数据集上，采用不同网络深度（ResNet18、34、50和101）时，平均等错误率（EER）/最小检测代价函数（minDCF）分别获得7.70%/11.76%的显著降低，同时参数量减少16.5%、浮点运算次数（FLOPs）减少4.1%。我们将其称为Gemini ResNet。进一步研究表明，所提出的黄金双子座操作点在不同训练条件和架构中均具有有效性。此外，我们基于前沿模型建立了新的基准——Gemini DF-ResNet。

相关内容

Gemini

关注 12

2023年12 月 6 日，谷歌 CEO 桑达尔・皮查伊官宣 Gemini 1.0 版正式上线。这次发布的 Gemini 大模型是原生多模态大模型，是谷歌大模型新时代的第一步，它包括三种量级：能力最强的 Gemini Ultra，适用于多任务的 Gemini Pro 以及适用于特定任务和端侧的 Gemini Nano。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日