Connectionist temporal classification (CTC) and attention-based encoder decoder (AED) joint training has been widely applied in automatic speech recognition (ASR). Unlike most hybrid models that separately calculate the CTC and AED losses, our proposed integrated-CTC utilizes the attention mechanism of AED to guide the output of CTC. In this paper, we employ two fusion methods, namely direct addition of logits (DAL) and preserving the maximum probability (PMP). We achieve dimensional consistency by adaptively affine transforming the attention results to match the dimensions of CTC. To accelerate model convergence and improve accuracy, we introduce auxiliary loss regularization for accelerated convergence. Experimental results demonstrate that the DAL method performs better in attention rescoring, while the PMP method excels in CTC prefix beam search and greedy search.
翻译:连接主义时序分类(CTC)与基于注意力机制的编码器-解码器(AED)联合训练已广泛应用于自动语音识别(ASR)。不同于多数混合模型分别计算CTC和AED损失,本文提出的集成CTC利用AED的注意力机制引导CTC输出。我们采用两种融合方法:直接添加对数值(DAL)与保留最大概率(PMP)。通过自适应仿射变换调整注意力结果的维度以匹配CTC,实现维度一致性。为加速模型收敛并提升精度,引入辅助损失正则化。实验结果表明,DAL方法在注意力重评分中表现更优,而PMP方法在CTC前缀束搜索与贪婪搜索中更具优势。