End-to-end automatic speech recognition has become the dominant paradigm in both academia and industry. To enhance recognition performance, the Weighted Finite-State Transducer (WFST) is widely adopted to integrate acoustic and language models through static graph composition, providing robust decoding and effective error correction. However, WFST decoding relies on a frame-by-frame autoregressive search over CTC posterior probabilities, which severely limits inference efficiency. Motivated by establishing a more principled compatibility between WFST decoding and CTC modeling, we systematically study the two fundamental components of CTC outputs, namely blank and non-blank frames, and identify a key insight: blank frames primarily encode positional information, while non-blank frames carry semantic content. Building on this observation, we introduce Keep-Only-One and Insert-Only-One, two decoding algorithms that explicitly exploit the structural roles of blank and non-blank frames to achieve significantly faster WFST-based inference without compromising recognition accuracy. Experiments on large-scale in-house, AISHELL-1, and LibriSpeech datasets demonstrate state-of-the-art recognition accuracy with substantially reduced decoding latency, enabling truly efficient and high-performance WFST decoding in modern speech recognition systems.
翻译:端到端自动语音识别已成为学术界和工业界的主流范式。为提升识别性能,加权有限状态转换器被广泛采用,通过静态图组合集成声学与语言模型,提供稳健的解码与有效的纠错能力。然而,WFST解码依赖于对CTC后验概率的逐帧自回归搜索,这严重限制了推理效率。为建立WFST解码与CTC建模间更具原理性的兼容关系,我们系统研究了CTC输出的两个基本组成部分——空白帧与非空白帧,并发现关键洞见:空白帧主要编码位置信息,而非空白帧承载语义内容。基于此观察,我们提出了Keep-Only-One与Insert-Only-One两种解码算法,显式利用空白帧与非空白帧的结构化角色,在保持识别精度的同时显著加速基于WFST的推理。在大规模内部数据集、AISHELL-1和LibriSpeech上的实验表明,该方法在显著降低解码延迟的同时实现了最先进的识别精度,为现代语音识别系统提供了真正高效且高性能的WFST解码方案。