Despite having hundreds of millions of speakers, handwritten Devanagari text remains severely underrepresented in publicly available benchmark datasets. Existing resources are limited in scale, focus primarily on isolated characters or short words, and lack controlled lexical content and writer level diversity, which restricts their utility for modern data driven handwriting analysis. As a result, they fail to capture the continuous, fused, and structurally complex nature of Devanagari handwriting, where characters are connected through a shared shirorekha (horizontal headline) and exhibit rich ligature formations. We introduce DohaScript, a large scale, multi writer dataset of handwritten Hindi text collected from 531 unique contributors. The dataset is designed as a parallel stylistic corpus, in which all writers transcribe the same fixed set of six traditional Hindi dohas (couplets). This controlled design enables systematic analysis of writer specific variation independent of linguistic content, and supports tasks such as handwriting recognition, writer identification, style analysis, and generative modeling. The dataset is accompanied by non identifiable demographic metadata, rigorous quality curation based on objective sharpness and resolution criteria, and page level layout difficulty annotations that facilitate stratified benchmarking. Baseline experiments demonstrate clear quality separation and strong generalization to unseen writers, highlighting the dataset's reliability and practical value. DohaScript is intended to serve as a standardized and reproducible benchmark for advancing research on continuous handwritten Devanagari text in low resource script settings.
翻译:尽管拥有数亿使用者,手写天城体文本在公开可用的基准数据集中仍严重代表性不足。现有资源规模有限,主要集中于孤立字符或短词,且缺乏受控的词汇内容和书写者层面的多样性,这限制了其在现代数据驱动笔迹分析中的实用性。因此,这些资源未能捕捉天城体手写文本连续、融合及结构复杂的特性——在天城体中,字符通过共享的शिरोरेखा(水平顶线)连接,并表现出丰富的连字形态。本文介绍DohaScript,一个从531位独立贡献者处收集的大规模、多书写者手写印地语文本数据集。该数据集被设计为平行风格语料库,所有书写者转录相同的六组传统印地语दोहा(对句)固定集合。这种受控设计使得在独立于语言内容的情况下系统分析书写者特异性变异成为可能,并支持手写识别、书写者身份识别、风格分析和生成建模等任务。数据集附带有非识别性人口统计元数据、基于客观清晰度和分辨率标准的严格质量筛选,以及页面级布局难度标注,便于分层基准测试。基线实验显示出清晰的质量区分度以及对未见书写者的强泛化能力,凸显了数据集的可靠性与实用价值。DohaScript旨在作为一个标准化、可复现的基准,用于推动低资源文字环境下连续手写天城体文本的研究进展。