For end-to-end Automatic Speech Recognition (ASR) models, recognizing personal or rare phrases can be hard. A promising way to improve accuracy is through spelling correction (or rewriting) of the ASR lattice, where potentially misrecognized phrases are replaced with acoustically similar and contextually relevant alternatives. However, rewriting is challenging for ASR models trained with connectionist temporal classification (CTC) due to noisy hypotheses produced by a non-autoregressive, context-independent beam search. We present a finite-state transducer (FST) technique for rewriting wordpiece lattices generated by Transformer-based CTC models. Our algorithm performs grapheme-to-phoneme (G2P) conversion directly from wordpieces into phonemes, avoiding explicit word representations and exploiting the richness of the CTC lattice. Our approach requires no retraining or modification of the ASR model. We achieved up to a 15.2% relative reduction in sentence error rate (SER) on a test set with contextually relevant entities.
翻译:对于端到端自动语音识别(ASR)模型而言,识别个性化或罕见短语往往较为困难。一种有效的改进途径是通过对ASR格进行拼写纠错(或称重写),将可能误识别的短语替换为声学特征相似且上下文相关的候选词。然而,对于采用连接时序分类(CTC)训练的ASR模型,由于非自回归、上下文无关的束搜索会产生噪声假设,重写任务面临严峻挑战。本文提出一种针对基于Transformer的CTC模型生成词片段格进行重写的有限状态转换器(FST)技术。该算法直接从词片段到音素进行字素到音素(G2P)转换,无需显式词汇表征,同时充分利用CTC格的信息密度。本方法无需对ASR模型进行重新训练或结构修改。在包含上下文相关实体的测试集上,我们实现了句子错误率(SER)最高达15.2%的相对降低。