Transformers have evolved with great success in various artificial intelligence tasks. Thanks to our recent prevalence of self-attention mechanisms, which capture long-term dependency, phenomenal outcomes in speech processing and recognition tasks have been produced. The paper presents a comprehensive survey of transformer techniques oriented in speech modality. The main contents of this survey include (1) background of traditional ASR, end-to-end transformer ecosystem, and speech transformers (2) foundational models in a speech via lingualism paradigm, i.e., monolingual, bilingual, multilingual, and cross-lingual (3) dataset and languages, acoustic features, architecture, decoding, and evaluation metric from a specific topological lingualism perspective (4) popular speech transformer toolkit for building end-to-end ASR systems. Finally, highlight the discussion of open challenges and potential research directions for the community to conduct further research in this domain.
翻译:Transformer模型在各种人工智能任务中取得了巨大成功。得益于近期自注意力机制的广泛应用——该机制能够捕捉长距离依赖关系,我们在语音处理和识别任务中取得了显著成果。本文对面向语音模态的Transformer技术进行了全面综述。本综述主要内容包括:(1) 传统自动语音识别背景、端到端Transformer生态系统及语音Transformer (2) 语言学范式下的语音基础模型,即单语、双语、多语及跨语言模型 (3) 从特定拓扑语言学视角探讨数据集与语言、声学特征、架构、解码及评估指标 (4) 用于构建端到端ASR系统的流行语音Transformer工具包。最后,重点讨论了该领域面临的开放挑战和潜在研究方向,以推动学界在此领域开展进一步研究。