Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification

Recent speaker verification (SV) systems have shown a trend toward adopting deeper speaker embedding extractors. Although deeper and larger neural networks can significantly improve performance, their substantial memory requirements hinder training on consumer GPUs. In this paper, we explore a memory-efficient training strategy for deep speaker embedding learning in resource-constrained scenarios. Firstly, we conduct a systematic analysis of GPU memory allocation during SV system training. Empirical observations show that activations and optimizer states are the main sources of memory consumption. For activations, we design two types of reversible neural networks which eliminate the need to store intermediate activations during back-propagation, thereby significantly reducing memory usage without performance loss. For optimizer states, we introduce a dynamic quantization approach that replaces the original 32-bit floating-point values with a dynamic tree-based 8-bit data type. Experimental results on VoxCeleb demonstrate that the reversible variants of ResNets and DF-ResNets can perform training without the need to cache activations in GPU memory. In addition, the 8-bit versions of SGD and Adam save 75% of memory costs while maintaining performance compared to their 32-bit counterparts. Finally, a detailed comparison of memory usage and performance indicates that our proposed models achieve up to 16.2x memory savings, with nearly identical parameters and performance compared to the vanilla systems. In contrast to the previous need for multiple high-end GPUs such as the A100, we can effectively train deep speaker embedding extractors with just one or two consumer-level 2080Ti GPUs.

翻译：近年来，说话人验证系统呈现出采用更深层说话人嵌入提取器的趋势。尽管更深、更大的神经网络能显著提升性能，但其巨大的内存需求阻碍了在消费级GPU上的训练。本文探索了一种在资源受限场景下用于深度说话人嵌入学习的内存高效训练策略。首先，我们对说话人验证系统训练过程中的GPU内存分配进行了系统性分析。实证观察表明，激活值和优化器状态是内存消耗的主要来源。针对激活值，我们设计了两种类型的可逆神经网络，它们消除了在反向传播过程中存储中间激活值的需求，从而在不损失性能的情况下显著降低了内存使用。针对优化器状态，我们引入了一种动态量化方法，用基于树的动态8位数据类型替代原始的32位浮点数值。在VoxCeleb数据集上的实验结果表明，ResNets和DF-ResNets的可逆变体能够在无需于GPU内存中缓存激活值的情况下进行训练。此外，与32位版本相比，8位版本的SGD和Adam优化器在保持性能的同时节省了75%的内存开销。最后，对内存使用和性能的详细对比表明，与原始系统相比，我们提出的模型在参数和性能几乎相同的情况下，实现了高达16.2倍的内存节省。与以往需要多块A100等高端GPU的情况不同，我们现在仅需一两块消费级的2080Ti GPU即可有效训练深度说话人嵌入提取器。