Single-word Automatic Speech Recognition (ASR) is a challenging task due to the lack of linguistic context and sensitivity to noise, pronunciation variation, and channel artifacts, especially in low-resource, communication-critical domains such as healthcare and emergency response. This paper reviews recent deep learning approaches and proposes a modular framework for robust single-word detection. The system combines denoising and normalization with a hybrid ASR front end (Whisper + Vosk) and a verification layer designed to handle out-of-vocabulary words and degraded audio. The verification layer supports multiple matching strategies, including embedding similarity, edit distance, and LLM-based matching with optional contextual guidance. We evaluate the framework on the Google Speech Commands dataset and a curated real-world dataset collected from telephony and messaging platforms under bandwidth-limited conditions. Results show that while the hybrid ASR front end performs well on clean audio, the verification layer significantly improves accuracy on noisy and compressed channels. Context-guided and LLM-based matching yield the largest gains, demonstrating that lightweight verification and context mechanisms can substantially improve single-word ASR robustness without sacrificing latency required for real-time telephony applications.
翻译:单词级自动语音识别(ASR)由于缺乏语言上下文,且对噪声、发音变化和信道伪影敏感,是一项具有挑战性的任务,在医疗和应急响应等低资源、通信关键领域尤为如此。本文回顾了近期深度学习方法,并提出了一种用于鲁棒性单词检测的模块化框架。该系统将去噪和归一化与混合ASR前端(Whisper + Vosk)以及专为处理词汇表外单词和低质音频设计的验证层相结合。验证层支持多种匹配策略,包括嵌入相似度、编辑距离以及基于大语言模型的匹配(可选配上下文引导)。我们在Google Speech Commands数据集和一个从带宽受限条件下的电话及消息平台收集的定制真实数据集上对该框架进行了评估。结果表明,虽然混合ASR前端在清晰音频上表现良好,但验证层能显著提升在噪声和压缩信道上的准确率。上下文引导和基于大语言模型的匹配带来了最大的性能提升,这表明轻量级验证与上下文机制能够在不牺牲实时电话应用所需延迟的前提下,大幅提升单词级ASR的鲁棒性。