Automatic Speech Recognition (ASR) has undergone a profound transformation over the past decade, driven by advances in deep learning. This survey provides a comprehensive overview of the modern era of ASR, charting its evolution from traditional hybrid systems, such as Gaussian Mixture Model-Hidden Markov Models (GMM-HMMs) and Deep Neural Network-HMMs (DNN-HMMs), to the now-dominant end-to-end neural architectures. We systematically review the foundational end-to-end paradigms: Connectionist Temporal Classification (CTC), attention-based encoder-decoder models, and the Recurrent Neural Network Transducer (RNN-T), which established the groundwork for fully integrated speech-to-text systems. We then detail the subsequent architectural shift towards Transformer and Conformer models, which leverage self-attention to capture long-range dependencies with high computational efficiency. A central theme of this survey is the parallel revolution in training paradigms. We examine the progression from fully supervised learning, augmented by techniques like SpecAugment, to the rise of self-supervised learning (SSL) with foundation models such as wav2vec 2.0, which drastically reduce the reliance on transcribed data. Furthermore, we analyze the impact of largescale, weakly supervised models like Whisper, which achieve unprecedented robustness through massive data diversity. The paper also covers essential ecosystem components, including key datasets and benchmarks (e.g., LibriSpeech, Switchboard, CHiME), standard evaluation metrics (e.g., Word Error Rate), and critical considerations for real-world deployment, such as streaming inference, on-device efficiency, and the ethical imperatives of fairness and robustness. We conclude by outlining open challenges and future research directions.
翻译:自动语音识别(ASR)在过去十年中经历了深刻变革,其驱动力主要来自深度学习领域的进展。本综述全面概述了ASR的现代发展历程,系统梳理了从传统混合系统(如高斯混合模型-隐马尔可夫模型与深度神经网络-隐马尔可夫模型)到当前占主导地位的端到端神经架构的演进路径。我们系统回顾了三大基础性端到端范式:连接时序分类、基于注意力的编码器-解码器模型以及循环神经网络传感器,这些模型为完全集成的语音到文本系统奠定了理论基础。随后,我们详细阐述了后续向Transformer和Conformer模型的架构转型,这些模型通过自注意力机制以高计算效率捕捉长程依赖关系。本综述的核心主线是训练范式的并行革命:我们追溯了从完全监督学习(辅以SpecAugment等增强技术)到自监督学习的兴起过程,其中以wav2vec 2.0为代表的基础模型显著降低了对标注数据的依赖。此外,我们分析了Whisper等大规模弱监督模型的影响,这类模型通过海量数据多样性实现了前所未有的鲁棒性。本文还涵盖了关键生态系统组成部分,包括核心数据集与基准测试(如LibriSpeech、Switchboard、CHiME)、标准评估指标(如词错误率),以及实际部署中的重要考量因素,如流式推理、设备端运行效率、公平性与鲁棒性等伦理要求。最后,我们展望了该领域面临的开放挑战与未来研究方向。