Transducer and Attention based Encoder-Decoder (AED) are two widely used frameworks for speech-to-text tasks. They are designed for different purposes and each has its own benefits and drawbacks for speech-to-text tasks. In order to leverage strengths of both modeling methods, we propose a solution by combining Transducer and Attention based Encoder-Decoder (TAED) for speech-to-text tasks. The new method leverages AED's strength in non-monotonic sequence to sequence learning while retaining Transducer's streaming property. In the proposed framework, Transducer and AED share the same speech encoder. The predictor in Transducer is replaced by the decoder in the AED model, and the outputs of the decoder are conditioned on the speech inputs instead of outputs from an unconditioned language model. The proposed solution ensures that the model is optimized by covering all possible read/write scenarios and creates a matched environment for streaming applications. We evaluate the proposed approach on the \textsc{MuST-C} dataset and the findings demonstrate that TAED performs significantly better than Transducer for offline automatic speech recognition (ASR) and speech-to-text translation (ST) tasks. In the streaming case, TAED outperforms Transducer in the ASR task and one ST direction while comparable results are achieved in another translation direction.
翻译:换能器(Transducer)与基于注意力的编码器-解码器(AED)是语音到文本任务中广泛使用的两种框架。它们针对不同目标设计,在语音到文本任务中各有优劣。为融合两种建模方法的优势,我们提出一种结合换能器与基于注意力的编码器-解码器(TAED)的解决方案,用于语音到文本任务。新方法利用AED在非单调序列到序列学习中的优势,同时保留换能器的流式特性。在所提框架中,换能器与AED共享相同的语音编码器。换能器中的预测器被AED模型中的解码器替代,且解码器的输出以语音输入为条件,而非来自无条件语言模型的输出。该方案确保模型通过覆盖所有可能的读/写场景进行优化,并为流式应用创建了匹配环境。我们在MuST-C数据集上评估了所提方法,实验结果表明,在离线自动语音识别(ASR)和语音到文本翻译(ST)任务中,TAED的性能显著优于换能器。在流式场景下,TAED在ASR任务和一个ST方向上超越换能器,而在另一翻译方向上取得了可比较的结果。