Previous studies demonstrate the impressive performance of residual neural networks (ResNet) in speaker verification. The ResNet models treat the time and frequency dimensions equally. They follow the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech representation. In this paper, we address this issue and look for optimal stride configurations specifically tailored for speaker verification. We represent the stride space on a trellis diagram, and conduct a systematic study on the impact of temporal and frequency resolutions on the performance and further identify two optimal points, namely Golden Gemini, which serves as a guiding principle for designing 2D ResNet-based speaker verification models. By following the principle, a state-of-the-art ResNet baseline model gains a significant performance improvement on VoxCeleb, SITW, and CNCeleb datasets with 7.70%/11.76% average EER/minDCF reductions, respectively, across different network depths (ResNet18, 34, 50, and 101), while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to it as Gemini ResNet. Further investigation reveals the efficacy of the proposed Golden Gemini operating points across various training conditions and architectures. Furthermore, we present a new benchmark, namely the Gemini DF-ResNet, using a cutting-edge model.
翻译:先前研究表明,残差神经网络(ResNet)在说话人验证中展现出卓越性能。ResNet模型将时间与频率维度同等对待,遵循为图像识别设计的默认步幅配置(此类任务中水平与垂直轴具有相似性)。然而,这种处理方法忽略了语音表征中时间与频率的不对称特性。本文针对该问题,探索专为说话人验证优化的最优步幅配置。我们使用网格图表示步幅空间,系统研究时间与频率分辨率对性能的影响,并进一步识别出两个最优操作点,命名为"黄金双子座",这为设计基于二维ResNet的说话人验证模型提供了指导原则。遵循该原则,现有最先进的ResNet基线模型在VoxCeleb、SITW和CNCeleb数据集上,采用不同网络深度(ResNet18、34、50和101)时,平均等错误率(EER)/最小检测代价函数(minDCF)分别获得7.70%/11.76%的显著降低,同时参数量减少16.5%、浮点运算次数(FLOPs)减少4.1%。我们将其称为Gemini ResNet。进一步研究表明,所提出的黄金双子座操作点在不同训练条件和架构中均具有有效性。此外,我们基于前沿模型建立了新的基准——Gemini DF-ResNet。