Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification

Previous studies demonstrate the impressive performance of residual neural networks (ResNet) in speaker verification. The ResNet models treat the time and frequency dimensions equally. They follow the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech representation. In this paper, we address this issue and look for optimal stride configurations specifically tailored for speaker verification. We represent the stride space on a trellis diagram, and conduct a systematic study on the impact of temporal and frequency resolutions on the performance and further identify two optimal points, namely Golden Gemini, which serves as a guiding principle for designing 2D ResNet-based speaker verification models. By following the principle, a state-of-the-art ResNet baseline model gains a significant performance improvement on VoxCeleb, SITW, and CNCeleb datasets with 7.70%/11.76% average EER/minDCF reductions, respectively, across different network depths (ResNet18, 34, 50, and 101), while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to it as Gemini ResNet. Further investigation reveals the efficacy of the proposed Golden Gemini operating points across various training conditions and architectures. Furthermore, we present a new benchmark, namely the Gemini DF-ResNet, using a cutting-edge model.

翻译：先前研究表明，残差神经网络在说话人验证任务中表现出色。ResNet模型在时间维度和频率维度上采用相同处理方式，遵循为图像识别设计的默认步长配置，其中水平轴与垂直轴具有相似性。这种方法忽略了语音表示中时间与频率的非对称特性。本文针对该问题，探索专为说话人验证优化的最优步长配置。我们基于格子图构建步长空间，系统研究时频分辨率对性能的影响，并识别出两个最优工作点——“金色双子”，这为基于二维ResNet的说话人验证模型设计提供了指导原则。遵循该原则，最先进的ResNet基线模型在VoxCeleb、SITW和CNCeleb数据集上，针对不同网络深度（ResNet18、34、50、101）分别实现了平均7.70%/11.76%的EER/minDCF性能提升，同时参数量减少16.5%、FLOPs降低4.1%。我们将此模型称为Gemini ResNet。进一步研究表明，所提出的金色双子工作点在不同训练条件和架构下均具有有效性。此外，我们基于最前沿模型提出了新的基准——Gemini DF-ResNet。

相关内容

Gemini

关注 12

2023年12 月 6 日，谷歌 CEO 桑达尔・皮查伊官宣 Gemini 1.0 版正式上线。这次发布的 Gemini 大模型是原生多模态大模型，是谷歌大模型新时代的第一步，它包括三种量级：能力最强的 Gemini Ultra，适用于多任务的 Gemini Pro 以及适用于特定任务和端侧的 Gemini Nano。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日