Automatic speech recognition systems have been shown to under-perform when it comes to transcribing words rarely seen in the training data, namely specialized terminology. Open-vocabulary keyword spotting, combined with contextual biasing, has been shown to mitigate this issue. However, existing systems can only handle glossaries of a few hundred terms without becoming an infeasible bottleneck. We propose a system that stores features with a memory footprint up to 128 times smaller than a comparable baseline and allows users to process massive databases while remaining open-vocabulary. Without fine-tuning the speech recognition model, our system achieves a comparable entity recall as uncompressed solutions, even in languages not seen during training.
翻译:自动语音识别系统在转录训练数据中罕见词汇(即专业术语)时表现出性能不足。开放词汇关键词检测结合上下文偏差已被证明可缓解此问题。然而,现有系统仅能处理数百个术语的词汇表,否则将成为不可行的瓶颈。我们提出一种系统,其存储特征的记忆占用比同类基线方案小128倍,允许用户在处理大规模数据库的同时保持开放词汇特性。无需微调语音识别模型,我们的系统在实体召回率上即可达到与未压缩方案相当的水平,甚至在训练中未见过的语言上也是如此。