CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5x runtime speedup and 2.78x memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.
翻译:基于连接时序分类(CTC)的自动语音识别系统在资源受限环境中面临计算与内存瓶颈。传统CTC解码器(例如在L4 GPU上运行的wav2vec2-large系统)因执行详尽的令牌级操作,其处理时间可占系统总耗时的90%,存在显著效率问题。本文提出面向连接时序分类的帧级令牌剪枝算法(FLToP CTC),该新型解码算法采用基于相对阈值概率的帧级令牌剪枝机制。通过动态剔除每帧中的低概率令牌,FLToP CTC在保持词错误率几乎不变的同时,显著降低了计算与内存需求。在LibriSpeech数据集上的实验表明,相较于标准CTC解码器,FLToP CTC实现了10.5倍的运行加速和2.78倍的内存压缩。其简洁的设计使其能够无缝集成到跨平台(CPU、GPU等)的CTC解码器中。FLToP CTC有效解决了CTC系统的性能瓶颈,为资源受限环境和实时应用提供了可扩展的解决方案,从而提升了语音识别的可访问性与执行效率。