Language models are useful adjuncts to optical models for producing accurate optical character recognition (OCR) results. One factor which limits the power of language models in this context is the existence of many specialized domains with language statistics very different from those implied by a general language model - think of checks, medical prescriptions, and many other specialized document classes. This paper introduces an algorithm for efficiently generating and attaching a domain specific word based language model at run time to a general language model in an OCR system. In order to best use this model the paper also introduces a modified CTC beam search decoder which effectively allows hypotheses to remain in contention based on possible future completion of vocabulary words. The result is a substantial reduction in word error rate in recognizing material from specialized domains.
翻译:语言模型是光学模型的重要辅助工具,用于生成精确的光学字符识别结果。在此背景下,限制语言模型能力的一个因素是存在许多专业领域,其语言统计特征与通用语言模型所隐含的特征差异显著——例如支票、医疗处方及其他诸多专业文档类别。本文提出一种算法,可在运行时高效生成并附加基于特定领域词汇的语言模型至OCR系统的通用语言模型中。为充分利用该模型,本文还引入了一种改进的CTC波束搜索解码器,该解码器允许假设基于词汇词可能的未来完成状态保持竞争活跃度。最终,在识别专业领域材料时,词错误率显著降低。