Conformer, a convolution-augmented Transformer variant, has become the de facto encoder architecture for speech processing due to its superior performance in various tasks, including automatic speech recognition (ASR), speech translation (ST) and spoken language understanding (SLU). Recently, a new encoder called E-Branchformer has outperformed Conformer in the LibriSpeech ASR benchmark, making it promising for more general speech applications. This work compares E-Branchformer and Conformer through extensive experiments using different types of end-to-end sequence-to-sequence models. Results demonstrate that E-Branchformer achieves comparable or better performance than Conformer in almost all evaluation sets across 15 ASR, 2 ST, and 3 SLU benchmarks, while being more stable during training. We will release our training configurations and pre-trained models for reproducibility, which can benefit the speech community.
翻译:Conformer作为卷积增强的Transformer变体,凭借其在自动语音识别(ASR)、语音翻译(ST)及口语理解(SLU)等多项任务中的卓越表现,已成为语音处理领域的默认编码器架构。近期,一种名为E-Branchformer的新型编码器在LibriSpeech ASR基准测试中超越了Conformer,展现出在更广泛语音应用中的潜力。本研究通过采用不同类型端到端序列到序列模型进行大量实验,对E-Branchformer与Conformer进行了对比。结果表明,在15个ASR、2个ST及3个SLU基准测试的几乎所有评估集中,E-Branchformer均取得了与Conformer相当或更优的性能,同时训练过程更稳定。我们将开源训练配置与预训练模型以确保可复现性,此举将惠及语音社区。