This paper presents a novel optimization framework for automatic speech recognition (ASR) with the aim of reducing hallucinations produced by an ASR model. The proposed framework optimizes the ASR model to maximize an expected factual consistency score between ASR hypotheses and ground-truth transcriptions, where the factual consistency score is computed by a separately trained estimator. Experimental results using the AMI meeting corpus and the VoxPopuli corpus show that the ASR model trained with the proposed framework generates ASR hypotheses that have significantly higher consistency scores with ground-truth transcriptions while maintaining the word error rates close to those of cross entropy-trained ASR models. Furthermore, it is shown that training the ASR models with the proposed framework improves the speech summarization quality as measured by the factual consistency of meeting conversation summaries generated by a large language model.
翻译:本文提出了一种针对自动语音识别(ASR)的新型优化框架,旨在减少ASR模型产生的幻觉现象。该框架通过优化ASR模型,最大化ASR假设与真实转录文本之间的预期事实一致性得分,其中事实一致性得分由单独训练的评估器计算得出。基于AMI会议语料库和VoxPopuli语料库的实验结果表明,采用所提框架训练的ASR模型生成的假设与真实转录文本具有显著更高的一致性得分,同时其词错误率与基于交叉熵训练的ASR模型保持相近水平。此外,研究显示,使用该框架训练的ASR模型能够提升语音摘要质量——具体表现为:由大型语言模型生成的会议对话摘要,在事实一致性指标上获得了改善。