Automatic speech recognition (ASR) systems have achieved remarkable performance in common conditions but often struggle to leverage long-context information in contextualized scenarios that require domain-specific knowledge, such as conference presentations. This challenge arises primarily due to constrained model context windows and the sparsity of relevant information within extensive contextual noise. To solve this, we propose the SAP$^{2}$ method, a novel framework that dynamically prunes and integrates relevant contextual keywords in two stages. Specifically, each stage leverages our proposed Speech-Driven Attention-based Pooling mechanism, enabling efficient compression of context embeddings while preserving speech-salient information. Experimental results demonstrate state-of-the-art performance of SAP$^{2}$ on the SlideSpeech and LibriSpeech datasets, achieving word error rates (WER) of 7.71% and 1.12%, respectively. On SlideSpeech, our method notably reduces biased keyword error rates (B-WER) by 41.1% compared to non-contextual baselines. SAP$^{2}$ also exhibits robust scalability, consistently maintaining performance under extensive contextual input conditions on both datasets.
翻译:自动语音识别(ASR)系统在常规条件下已取得显著性能,但在需要领域特定知识的情境化场景(如会议报告)中,往往难以有效利用长上下文信息。这一挑战主要源于受限的模型上下文窗口以及海量上下文噪声中相关信息的稀疏性。为解决此问题,我们提出SAP$^{2}$方法——一种通过两阶段动态剪裁与集成相关上下文关键词的新型框架。具体而言,每个阶段均采用我们提出的语音驱动注意力池化机制,在保持语音显著信息的同时实现对上下文嵌入的高效压缩。实验结果表明,SAP$^{2}$在SlideSpeech和LibriSpeech数据集上取得了最先进的性能,词错误率(WER)分别达到7.71%和1.12%。在SlideSpeech数据集上,本方法相较于非上下文基线模型将偏差关键词错误率(B-WER)显著降低41.1%。SAP$^{2}$还展现出优异的可扩展性,在两个数据集的大规模上下文输入条件下均能保持稳定的性能表现。