Deep-learning speaker verification (SV) increasingly relies on deep neural network backbones, whose environmental impact remains largely undocumented. In this paper, we conduct an evaluation of ResNet architectures trained on VoxCeleb2, varying depth, channel width, and stage distribution, and measure energy consumption and carbon footprint using node-level sensors. Results show a clear point of diminishing returns: deeper or wider models bring only marginal accuracy gains while energy consumption grows steeply. In contrast, mid-sized networks such as ResNet-50 and stage-concentrated variants achieve favorable trade-offs between performance and environmental impact. These findings provide actionable guidelines for designing energy-efficient SV systems.
翻译:深度学习说话人验证(SV)日益依赖深度神经网络主干结构,但其环境影响尚未得到充分记录。本文对基于VoxCeleb2训练的ResNet架构开展评估,通过改变网络深度、通道宽度和阶段分布,并利用节点级传感器测量能耗与碳足迹。结果表明收益递减的明确临界点:更宽或更深的模型仅带来微小准确率提升,但能耗呈陡峭增长。相比之下,ResNet-50及阶段集中型变体等中等规模网络,在性能与环境影响之间实现了较优平衡。这些发现为设计高能效SV系统提供了可操作的指导原则。