Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text and speech become increasingly interleaved in long interactions, ASR correction requires conversation-level contextual evidence. Existing ASR correction methods often rely on the current hypothesis or concatenate raw dialogue history. In such contexts, sparse correction evidence can be difficult to locate amid redundancy and noise. Addressing these challenges, we propose an ontology memory-augmented ASR correction framework for long text-speech interleaved conversations. The framework organizes preceding interaction history into a dynamically updatable ontology memory, where entities, terminology, surface variants, potential ASR confusions, and semantic relations are stored as retrievable nodes for context-grounded correction. To evaluate this setting, we construct RAMC-Corr, a dataset derived from MAGIC-RAMC for long-range ASR correction with grounded context. Experiments on RAMC-Corr show that our method improves over direct correction in 9 out of 10 paired backbone-setting combinations and encourages more selective and evidence-grounded corrections for context-dependent ASR errors.
翻译:自动语音识别(ASR)纠错传统上聚焦于孤立语句或短距离局部语境。然而,随着文本与语音在长交互中愈发交错,ASR纠错需要对话级别的语境证据。现有ASR纠错方法通常依赖当前假设或拼接原始对话历史,在此类语境中,稀疏的纠错证据可能因冗余与噪声而难以定位。针对这些挑战,我们提出一种本体记忆增强的ASR纠错框架,用于长文本语音交错对话。该框架将先前交互历史组织为可动态更新的本体记忆,其中实体、术语、表面变体、潜在ASR混淆项及语义关系均作为可检索节点存储,以支持基于语境的纠错。为评估此设定,我们构建了RAMC-Corr数据集,该数据集基于MAGIC-RAMC,旨在实现带情境语境的长距离ASR纠错。在RAMC-Corr上的实验表明,我们的方法在10组配对主干-设置组合中的9组优于直接纠错,并鼓励针对语境相关ASR错误进行更具选择性和有据可依的纠错。