In this paper, we propose a novel approach to address the challenges of printed Urdu text recognition using high-resolution, multi-scale semantic feature extraction. Our proposed UTRNet architecture, a hybrid CNN-RNN model, demonstrates state-of-the-art performance on benchmark datasets. To address the limitations of previous works, which struggle to generalize to the intricacies of the Urdu script and the lack of sufficient annotated real-world data, we have introduced the UTRSet-Real, a large-scale annotated real-world dataset comprising over 11,000 lines and UTRSet-Synth, a synthetic dataset with 20,000 lines closely resembling real-world and made corrections to the ground truth of the existing IIITH dataset, making it a more reliable resource for future research. We also provide UrduDoc, a benchmark dataset for Urdu text line detection in scanned documents. Additionally, we have developed an online tool for end-to-end Urdu OCR from printed documents by integrating UTRNet with a text detection model. Our work not only addresses the current limitations of Urdu OCR but also paves the way for future research in this area and facilitates the continued advancement of Urdu OCR technology. The project page with source code, datasets, annotations, trained models, and online tool is available at abdur75648.github.io/UTRNet.
翻译:本文提出了一种新方法,利用高分辨率多尺度语义特征提取来解决印刷乌尔都语文本识别面临的挑战。我们提出的UTRNet架构是一种混合CNN-RNN模型,在基准数据集上展现了最先进的性能。针对先前工作在泛化乌尔都语脚本复杂性方面以及缺乏足够真实标注数据等局限性,我们引入了UTRSet-Real(一个包含超过11000行的大规模真实标注数据集)和UTRSet-Synth(一个包含20000行、高度模拟真实场景的合成数据集),并对现有IIITH数据集的地面真值进行了修正,使其成为未来研究更可靠的资源。我们还提供了UrduDoc,一个用于扫描文档中乌尔都语文本行检测的基准数据集。此外,通过将UTRNet与文本检测模型集成,我们开发了一个用于印刷文档端到端乌尔都语OCR的在线工具。我们的工作不仅解决了当前乌尔都语OCR的局限性,而且为该领域的未来研究铺平了道路,并促进了乌尔都语OCR技术的持续进步。项目页面(含源代码、数据集、注释、预训练模型和在线工具)位于abdur75648.github.io/UTRNet。