We present a first large-scale public synthetic dataset for contextual spellchecking customization of automatic speech recognition (ASR) with focus on diverse rare and out-of-vocabulary (OOV) phrases, such as proper names or terms. The proposed approach allows creating millions of realistic examples of corrupted ASR hypotheses and simulate non-trivial biasing lists for the customization task. Furthermore, we propose injecting two types of ``hard negatives" to the simulated biasing lists in training examples and describe our procedures to automatically mine them. We report experiments with training an open-source customization model on the proposed dataset and show that the injection of hard negative biasing phrases decreases WER and the number of false alarms.
翻译:我们提出了首个面向自动语音识别(ASR)上下文拼写检查定制的大规模公开合成数据集,重点覆盖多种罕见的集外词(OOV)短语,如专有名词或术语。所提方法能够生成数百万个逼真的错误ASR假设示例,并为定制任务模拟非平凡的偏置列表。此外,我们提出在训练样本的模拟偏置列表中注入两类“硬负样本”,并描述了自动挖掘这些样本的流程。通过在所提数据集上训练开源定制模型进行的实验表明,注入硬负偏置短语可降低词错误率(WER)并减少误报数量。