synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier

Optical Character Recognition (OCR) for low-resource languages remains a significant challenge due to the scarcity of large-scale annotated training datasets. Languages such as Kashmiri, with approximately 7 million speakers and a complex Perso-Arabic script featuring unique diacritical marks, currently lack support in major OCR systems including Tesseract, TrOCR, and PaddleOCR. Manual dataset creation for such languages is prohibitively expensive, time-consuming, and error-prone, often requiring word by word transcription of printed or handwritten text. We present SynthOCR-Gen, an open-source synthetic OCR dataset generator specifically designed for low-resource languages. Our tool addresses the fundamental bottleneck in OCR development by transforming digital Unicode text corpora into ready-to-use training datasets. The system implements a comprehensive pipeline encompassing text segmentation (character, word, n-gram, sentence, and line levels), Unicode normalization with script purity enforcement, multi-font rendering with configurable distribution, and 25+ data augmentation techniques simulating real-world document degradations including rotation, blur, noise, and scanner artifacts. We demonstrate the efficacy of our approach by generating a 600,000-sample word-segmented Kashmiri OCR dataset, which we release publicly on HuggingFace. This work provides a practical pathway for bringing low-resource languages into the era of vision-language AI models, and the tool is openly available for researchers and practitioners working with underserved writing systems worldwide.

翻译：低资源语言的光学字符识别（OCR）由于缺乏大规模标注训练数据集，仍然是一个重大挑战。例如，克什米尔语拥有约700万使用者，其复杂的波斯-阿拉伯文字包含独特的变音符号，目前在Tesseract、TrOCR和PaddleOCR等主流OCR系统中均未得到支持。为这类语言手动创建数据集成本极高、耗时漫长且容易出错，通常需要对印刷或手写文本逐字转录。本文提出SynthOCR-Gen，一个专为低资源语言设计的开源合成OCR数据集生成器。该工具通过将数字Unicode文本语料库转换为可直接使用的训练数据集，解决了OCR开发中的根本瓶颈。系统实现了完整的处理流程，包括文本分割（字符、词、n-gram、句子和行级别）、强制保持文字纯度的Unicode规范化、可配置分布的多字体渲染，以及25种以上模拟真实文档退化的数据增强技术（包括旋转、模糊、噪声和扫描伪影）。我们通过生成包含60万个样本的克什米尔语单词级分割OCR数据集（已在HuggingFace平台公开发布）验证了方法的有效性。这项工作为将低资源语言引入视觉-语言AI模型时代提供了可行路径，该工具已向全球研究非主流文字系统的学者和从业者开放使用。