Extending TrOCR for Text Localization-Free OCR of Full-Page Scanned Receipt Images

Digitization of scanned receipts aims to extract text from receipt images and save it into structured documents. This is usually split into two sub-tasks: text localization and optical character recognition (OCR). Most existing OCR models only focus on the cropped text instance images, which require the bounding box information provided by a text region detection model. Introducing an additional detector to identify the text instance images in advance adds complexity, however instance-level OCR models have very low accuracy when processing the whole image for the document-level OCR, such as receipt images containing multiple text lines arranged in various layouts. To this end, we propose a localization-free document-level OCR model for transcribing all the characters in a receipt image into an ordered sequence end-to-end. Specifically, we finetune the pretrained instance-level model TrOCR with randomly cropped image chunks, and gradually increase the image chunk size to generalize the recognition ability from instance images to full-page images. In our experiments on the SROIE receipt OCR dataset, the model finetuned with our strategy achieved 64.4 F1-score and a 22.8% character error rate (CER), respectively, which outperforms the baseline results with 48.5 F1-score and 50.6% CER. The best model, which splits the full image into 15 equally sized chunks, gives 87.8 F1-score and 4.98% CER with minimal additional pre or post-processing of the output. Moreover, the characters in the generated document-level sequences are arranged in the reading order, which is practical for real-world applications.

翻译：扫描收据的数字化旨在从收据图像中提取文本并保存为结构化文档。这通常分为两个子任务：文本定位和光学字符识别（OCR）。现有大多数OCR模型仅关注裁剪后的文本实例图像，这需要文本区域检测模型提供的边界框信息。引入额外检测器预先识别文本实例图像会增加复杂性，然而当处理整张图像进行文档级OCR时（例如包含多行文本且布局各异的收据图像），实例级OCR模型的准确率极低。为此，我们提出一种无定位的文档级OCR模型，能够将收据图像中的所有字符端到端地转录为有序序列。具体而言，我们对预训练的实例级模型TrOCR使用随机裁剪的图像块进行微调，并逐步增大图像块尺寸，以将识别能力从实例图像泛化至整页图像。在SROIE收据OCR数据集上的实验中，采用我们策略微调的模型分别取得了64.4的F1分数和22.8%的字符错误率（CER），优于基线模型的48.5 F1分数和50.6% CER。将整张图像均匀分割为15个图像块的最优模型，在输出仅需极少额外预处理或后处理的情况下，取得了87.8的F1分数和4.98%的CER。此外，生成的文档级序列中的字符按阅读顺序排列，这在实际应用中具有实用价值。