The neural transducer is an end-to-end model for automatic speech recognition (ASR). While the model is well-suited for streaming ASR, the training process remains challenging. During training, the memory requirements may quickly exceed the capacity of state-of-the-art GPUs, limiting batch size and sequence lengths. In this work, we analyze the time and space complexity of a typical transducer training setup. We propose a memory-efficient training method that computes the transducer loss and gradients sample by sample. We present optimizations to increase the efficiency and parallelism of the sample-wise method. In a set of thorough benchmarks, we show that our sample-wise method significantly reduces memory usage, and performs at competitive speed when compared to the default batched computation. As a highlight, we manage to compute the transducer loss and gradients for a batch size of 1024, and audio length of 40 seconds, using only 6 GB of memory.
翻译:神经换能器是一种用于自动语音识别(ASR)的端到端模型。尽管该模型适用于流式ASR,但其训练过程仍具有挑战性。训练期间,内存需求可能迅速超出主流GPU的容量,从而限制批处理大小和序列长度。本研究分析了典型换能器训练框架的时间与空间复杂度,提出了一种内存高效的训练方法,该方法逐样本计算换能器损失与梯度。我们通过优化提升了逐样本方法的计算效率与并行性。在全面的基准测试中,我们发现逐样本方法显著降低了内存使用量,且与默认的批量计算相比,速度具有竞争力。值得关注的是,我们成功实现了在仅6GB内存条件下,对批大小为1024、音频长度为40秒的数据计算换能器损失与梯度。