During the entire training process of the ASR model, the intensity of data augmentation and the approach of calculating training loss are applied in a regulated manner based on preset parameters. For example, SpecAugment employs a predefined strength of augmentation to mask parts of the time-frequency domain spectrum. Similarly, in CTC-based multi-layer models, the loss is generally determined based on the output of the encoder's final layer during the training process. However, ignoring dynamic characteristics may suboptimally train models. To address the issue, we present a two-stage training method, known as complexity-boosted adaptive (CBA) training. It involves making dynamic adjustments to data augmentation strategies and CTC loss propagation based on the complexity of the training samples. In the first stage, we train the model with intermediate-CTC-based regularization and data augmentation without any adaptive policy. In the second stage, we propose a novel adaptive policy, called MinMax-IBF, which calculates the complexity of samples. We combine the MinMax-IBF policy to data augmentation and intermediate CTC loss regularization to continue training. The proposed CBA training approach shows considerable improvements, up to 13.4% and 14.1% relative reduction in WER on the LibriSpeech 100h test-clean and test-other dataset and also up to 6.3% relative reduction on AISHELL-1 test set, over the Conformer architecture in Wenet.
翻译:在ASR模型的整个训练过程中,数据增强的强度与训练损失的计算方式通常基于预设参数进行固定化应用。例如,SpecAugment采用预定义的增强强度对时频域频谱部分进行掩码处理。类似地,在基于CTC的多层模型中,训练过程中的损失通常仅依据编码器最终层的输出确定。然而,忽略动态特性可能导致模型训练欠优。为解决该问题,本文提出一种两阶段训练方法,称为复杂度增强自适应(CBA)训练。该方法根据训练样本的复杂度,对数据增强策略与CTC损失传播进行动态调整。在第一阶段,我们采用基于中间层CTC的正则化与数据增强策略训练模型,不引入自适应策略。在第二阶段,我们提出一种名为MinMax-IBF的新型自适应策略,用于计算样本复杂度。我们将MinMax-IBF策略与数据增强及中间层CTC损失正则化相结合以继续训练。所提出的CBA训练方法在Wenet的Conformer架构基础上取得显著改进:在LibriSpeech 100h的test-clean与test-other数据集上分别实现13.4%与14.1%的相对词错误率降低,在AISHELL-1测试集上实现最高6.3%的相对词错误率降低。