In this paper, we propose a novel approach to address the challenges of printed Urdu text recognition using high-resolution, multi-scale semantic feature extraction. Our proposed UTRNet architecture, a hybrid CNN-RNN model, demonstrates state-of-the-art performance on benchmark datasets. To address the limitations of previous works, which struggle to generalize to the intricacies of the Urdu script and the lack of sufficient annotated real-world data, we have introduced the UTRSet-Real, a large-scale annotated real-world dataset comprising over 11,000 lines and UTRSet-Synth, a synthetic dataset with 20,000 lines closely resembling real-world and made corrections to the ground truth of the existing IIITH dataset, making it a more reliable resource for future research. We also provide UrduDoc, a benchmark dataset for Urdu text line detection in scanned documents. Additionally, we have developed an online tool for end-to-end Urdu OCR from printed documents by integrating UTRNet with a text detection model. Our work not only addresses the current limitations of Urdu OCR but also paves the way for future research in this area and facilitates the continued advancement of Urdu OCR technology. The project page with source code, datasets, annotations, trained models, and online tool is available at abdur75648.github.io/UTRNet.
翻译:本文提出一种新方法,通过高分辨率多尺度语义特征提取应对印刷乌尔都语文本识别的挑战。所提出的UTRNet架构(一种混合CNN-RNN模型)在基准数据集上展现了最先进的性能。针对先前工作在泛化乌尔都语脚本复杂性以及缺乏充足标注真实世界数据方面的局限,我们引入了包含超过11,000行的超大规模标注真实世界数据集UTRSet-Real、包含20,000行高度模仿真实世界场景的合成数据集UTRSet-Synth,并对现有IIITH数据集的标注进行了修正,使其成为未来研究更可靠的资源。同时,我们提供了用于扫描文档中乌尔都语文本行检测的基准数据集UrduDoc。此外,通过将UTRNet与文本检测模型集成,我们开发了面向印刷文档的端到端乌尔都语OCR在线工具。本工作不仅解决了当前乌尔都语OCR的局限性,更为该领域的未来研究奠定基础,推动乌尔都语OCR技术的持续进步。项目页面(包含源代码、数据集、标注、预训练模型及在线工具)位于abdur75648.github.io/UTRNet。