Attention-based encoder-decoder (AED) models have shown impressive performance in ASR. However, most existing AED methods neglect to simultaneously leverage both acoustic and semantic features in decoder, which is crucial for generating more accurate and informative semantic states. In this paper, we propose an Acoustic and Semantic Cooperative Decoder (ASCD) for ASR. In particular, unlike vanilla decoders that process acoustic and semantic features in two separate stages, ASCD integrates them cooperatively. To prevent information leakage during training, we design a Causal Multimodal Mask. Moreover, a variant Semi-ASCD is proposed to balance accuracy and computational cost. Our proposal is evaluated on the publicly available AISHELL-1 and aidatatang_200zh datasets using Transformer, Conformer, and Branchformer as encoders, respectively. The experimental results show that ASCD significantly improves the performance by leveraging both the acoustic and semantic information cooperatively.
翻译:基于注意力机制的编码器-解码器(AED)模型在自动语音识别(ASR)中展现出卓越性能。然而,现有大多数AED方法未能同时在解码器中充分利用声学与语义特征,而这对于生成更准确且富含信息的语义状态至关重要。本文提出一种面向ASR的声学与语义协同解码器(ASCD)。具体而言,与将声学与语义特征分两阶段处理的传统解码器不同,ASCD将其进行协同整合。为防止训练过程中的信息泄露,我们设计了因果多模态掩码。此外,为平衡准确性与计算成本,我们提出了变体模型Semi-ASCD。该方案分别在AISHELL-1及aidatatang_200zh公开数据集上,以Transformer、Conformer和Branchformer作为编码器进行评估。实验结果表明,ASCD通过协同利用声学与语义信息显著提升了系统性能。