State-of-the-art information extraction methods are limited by OCR errors. They work well for printed text in form-like documents, but unstructured, handwritten documents still remain a challenge. Adapting existing models to domain-specific training data is quite expensive, because of two factors, 1) limited availability of the domain-specific documents (such as handwritten prescriptions, lab notes, etc.), and 2) annotations become even more challenging as one needs domain-specific knowledge to decode inscrutable handwritten document images. In this work, we focus on the complex problem of extracting medicine names from handwritten prescriptions using only weakly labeled data. The data consists of images along with the list of medicine names in it, but not their location in the image. We solve the problem by first identifying the regions of interest, i.e., medicine lines from just weak labels and then injecting a domain-specific medicine language model learned using only synthetically generated data. Compared to off-the-shelf state-of-the-art methods, our approach performs >2.5x better in medicine names extraction from prescriptions.
翻译:当前最先进的信息抽取方法受限于OCR错误。这些方法在表单类文档的印刷文本上表现良好,但非结构化手写文档仍是挑战。将现有模型适应到特定领域训练数据成本高昂,原因有二:1)领域特定文档(如手写处方、实验记录等)的可用性有限,2)由于需要领域特定知识来解码难以辨认的手写文档图像,标注变得更具挑战性。本研究聚焦于仅使用弱标注数据从手写处方中提取药品名称这一复杂问题。数据包含图像及其中的药品名称列表,但未提供名称在图像中的位置信息。我们通过以下方法解决该问题:首先仅利用弱标签识别感兴趣区域(即药品行),随后注入仅通过合成数据学习的领域特定药品语言模型。与现有最先进方法相比,本方法在处方药品名称提取任务上的性能提升了2.5倍以上。