End-to-end automatic speech recognition (ASR) systems have made significant progress in general scenarios. However, it remains challenging to transcribe contextual named entities (NEs) in the contextual ASR scenario. Previous approaches have attempted to address this by utilizing the NE dictionary. These approaches treat entities as individual tokens and generate them token-by-token, which may result in incomplete transcriptions of entities. In this paper, we treat entities as indivisible wholes and introduce the idea of copying into ASR. We design a systematic mechanism called CopyNE, which can copy entities from the NE dictionary. By copying all tokens of an entity at once, we can reduce errors during entity transcription, ensuring the completeness of the entity. Experiments demonstrate that CopyNE consistently improves the accuracy of transcribing entities compared to previous approaches. Even when based on the strong Whisper, CopyNE still achieves notable improvements.
翻译:端到端自动语音识别系统在通用场景中已取得显著进展。然而,在上下文语音识别场景中准确转录上下文命名实体仍然具有挑战性。先前的研究尝试通过利用命名实体词典来解决此问题,这些方法将实体视为独立标记并逐标记生成,可能导致实体转录不完整。本文提出将实体视为不可分割的整体,并将复制机制引入语音识别系统。我们设计了一种名为CopyNE的系统化机制,能够从命名实体词典中直接复制实体。通过一次性复制实体的所有标记,我们可以减少实体转录过程中的错误,确保实体的完整性。实验表明,与现有方法相比,CopyNE能持续提升实体转录的准确率。即使在强大的Whisper模型基础上,CopyNE仍能实现显著改进。