Micro-batch clipping, a gradient clipping method, has recently shown potential in enhancing auto-speech recognition (ASR) model performance. However, the underlying mechanism behind this improvement remains mysterious, particularly the observation that only certain micro-batch sizes are beneficial. In this paper, we make the first attempt to explain this phenomenon. Inspired by recent data pruning research, we assume that specific training samples may impede model convergence during certain training phases. Under this assumption, the convergence analysis shows that micro-batch clipping can improve the convergence rate asymptotically at the cost of an additional constant bias that does not diminish with more training iterations. The bias is dependent on a few factors and can be minimized at specific micro-batch size, thereby elucidating the existence of the sweet-spot micro-batch size observed previously. We also verify the effectiveness of micro-batch clipping beyond speech models on vision and language models, and show promising performance gains in these domains. An exploration of potential limitations shows that micro-batch clipping is less effective when training data originates from multiple distinct domains.
翻译:微批次裁剪作为一种梯度裁剪方法,近期在提升自动语音识别(ASR)模型性能方面展现出潜力。然而,这种改进背后的机理仍不明确,特别是为何仅特定微批次尺寸能带来收益。本文首次尝试解释这一现象。受近期数据剪枝研究的启发,我们假设特定训练样本可能在特定训练阶段阻碍模型收敛。在此假设下,收敛性分析表明,微批次裁剪能以渐近方式提升收敛速率,代价是引入一个不随训练迭代次数增加而减小的额外常数偏差。该偏差取决于若干因素,并可在特定微批次尺寸下达到最小,从而阐明了先前观察到的“最优微批次尺寸”的存在。我们还在语音模型之外的视觉与语言模型上验证了微批次裁剪的有效性,并展示了这些领域内具有前景的性能提升。对潜在局限性的探索表明,当训练数据源自多个不同领域时,微批次裁剪的效果会减弱。