This paper evaluates the performance of Large Multimodal Models (LMMs) on Optical Character Recognition (OCR) in the low-resource Pashto language. Natural Language Processing (NLP) in Pashto faces several challenges due to the cursive nature of its script and a scarcity of structured datasets. To address this, we developed a synthetic Pashto OCR dataset, PsOCR, consisting of one million images annotated with bounding boxes at word, line, and document levels, suitable for training and evaluating models based on different architectures, including Convolutional Neural Networks (CNNs) and Transformers. PsOCR covers variations across 1,000 unique font families, colors, image sizes, and layouts. A benchmark subset of 10K images was selected to evaluate the performance of several LMMs, including seven open-source models: DeepSeek's Janus, InternVL, MiniCPM, Florence, and Qwen (3B and 7B), and four closed-source models: GPT-4o, Gemini, Claude, and Grok. Experimental results demonstrate that Gemini achieves the best performance among all models, whereas among open-source models, Qwen-7B stands out. This work provides an insightful assessment of the capabilities and limitations of current LMMs for OCR tasks in Pashto and establishes a foundation for further research not only in Pashto OCR but also for other similar scripts such as Arabic, Persian, and Urdu. PsOCR is available at https://github.com/zirak-ai/PashtoOCR.
翻译:本文评估了大规模多模态模型在低资源普什图语光学字符识别任务上的性能。普什图语的自然语言处理因其文字的连写特性及结构化数据集的稀缺而面临诸多挑战。为此,我们构建了一个合成的普什图语OCR数据集PsOCR,该数据集包含一百万张图像,并在单词、行和文档级别标注了边界框,适用于基于不同架构(包括卷积神经网络和Transformer)的模型训练与评估。PsOCR涵盖了1000种独特字体族、颜色、图像尺寸和版式的变化。我们从中选取了包含1万张图像的基准子集,用于评估多个LMM的性能,其中包括七个开源模型:DeepSeek的Janus、InternVL、MiniCPM、Florence以及Qwen(3B和7B版本),以及四个闭源模型:GPT-4o、Gemini、Claude和Grok。实验结果表明,Gemini在所有模型中取得了最佳性能,而在开源模型中,Qwen-7B表现最为突出。本研究对当前LMM在普什图语OCR任务中的能力与局限进行了深入评估,不仅为普什图语OCR的进一步研究奠定了基础,也为阿拉伯语、波斯语、乌尔都语等类似文字的相关研究提供了参考。PsOCR数据集可通过https://github.com/zirak-ai/PashtoOCR获取。