Aiming to improve the Automatic Speech Recognition (ASR) outputs with a post-processing step, ASR error correction (EC) techniques have been widely developed due to their efficiency in using parallel text data. Previous works mainly focus on using text or/ and speech data, which hinders the performance gain when not only text and speech information, but other modalities, such as visual information are critical for EC. The challenges are mainly two folds: one is that previous work fails to emphasize visual information, thus rare exploration has been studied. The other is that the community lacks a high-quality benchmark where visual information matters for the EC models. Therefore, this paper provides 1) simple yet effective methods, namely gated fusion and image captions as prompts to incorporate visual information to help EC; 2) large-scale benchmark datasets, namely Visual-ASR-EC, where each item in the training data consists of visual, speech, and text information, and the test data are carefully selected by human annotators to ensure that even humans could make mistakes when visual information is missing. Experimental results show that using captions as prompts could effectively use the visual information and surpass state-of-the-art methods by upto 1.2% in Word Error Rate(WER), which also indicates that visual information is critical in our proposed Visual-ASR-EC dataset
翻译:针对自动语音识别(ASR)输出的后处理优化问题,ASR纠错(EC)技术因能高效利用平行文本数据而得到广泛发展。现有研究主要依赖文本或语音数据,但当文本与语音信息之外的其他模态(如视觉信息)对纠错任务具有关键作用时,这种单一模态的局限性将制约性能提升。当前面临两大挑战:其一,现有研究未能有效利用视觉信息,导致相关探索极为匮乏;其二,该领域缺乏高质量基准数据集来验证视觉信息对纠错模型的价值。为此,本文提出:1)简单且有效的视觉信息融合方法——门控融合机制与图像描述提示方法;2)大规模基准数据集Visual-ASR-EC,其训练数据包含视觉、语音和文本三模态信息,测试数据经人工精心筛选以确保即便人类在缺失视觉信息时也会产生误判。实验结果表明,采用图像描述作为提示词可有效利用视觉信息,相比现有最优方法在词错误率(WER)上最高降低1.2%,这充分证明视觉信息在本文提出的Visual-ASR-EC数据集中具有关键作用。