The conventional recipe for Automatic Speech Recognition (ASR) models is to 1) train multiple checkpoints on a training set while relying on a validation set to prevent overfitting using early stopping and 2) average several last checkpoints or that of the lowest validation losses to obtain the final model. In this paper, we rethink and update the early stopping and checkpoint averaging from the perspective of the bias-variance tradeoff. Theoretically, the bias and variance represent the fitness and variability of a model and the tradeoff of them determines the overall generalization error. But, it's impractical to evaluate them precisely. As an alternative, we take the training loss and validation loss as proxies of bias and variance and guide the early stopping and checkpoint averaging using their tradeoff, namely an Approximated Bias-Variance Tradeoff (ApproBiVT). When evaluating with advanced ASR models, our recipe provides 2.5%-3.7% and 3.1%-4.6% CER reduction on the AISHELL-1 and AISHELL-2, respectively.
翻译:摘要:自动语音识别(ASR)模型的常规训练流程是:1)在训练集上训练多个检查点,同时依赖验证集通过早停法防止过拟合;2)对最后几个检查点或验证损失最低的检查点进行平均,以获得最终模型。本文从偏差-方差权衡的视角重新审视并改进了早停法与检查点平均策略。理论上,偏差和方差分别表征模型的拟合程度与变异性,二者的权衡决定了总体泛化误差。然而,精确评估偏差和方差在实践中并不可行。为此,我们采用训练损失和验证损失分别作为偏差和方差的代理指标,并利用它们的权衡关系(即近似偏差-方差权衡,简称ApproBiVT)来指导早停和检查点平均。在先进ASR模型上的评估表明,我们的方法在AISHELL-1和AISHELL-2数据集上分别实现了2.5%-3.7%和3.1%-4.6%的字错误率(CER)降低。