DohaScript: A Large-Scale Multi-Writer Dataset for Continuous Handwritten Hindi Text

Despite having hundreds of millions of speakers, handwritten Devanagari text remains severely underrepresented in publicly available benchmark datasets. Existing resources are limited in scale, focus primarily on isolated characters or short words, and lack controlled lexical content and writer level diversity, which restricts their utility for modern data driven handwriting analysis. As a result, they fail to capture the continuous, fused, and structurally complex nature of Devanagari handwriting, where characters are connected through a shared shirorekha (horizontal headline) and exhibit rich ligature formations. We introduce DohaScript, a large scale, multi writer dataset of handwritten Hindi text collected from 531 unique contributors. The dataset is designed as a parallel stylistic corpus, in which all writers transcribe the same fixed set of six traditional Hindi dohas (couplets). This controlled design enables systematic analysis of writer specific variation independent of linguistic content, and supports tasks such as handwriting recognition, writer identification, style analysis, and generative modeling. The dataset is accompanied by non identifiable demographic metadata, rigorous quality curation based on objective sharpness and resolution criteria, and page level layout difficulty annotations that facilitate stratified benchmarking. Baseline experiments demonstrate clear quality separation and strong generalization to unseen writers, highlighting the dataset's reliability and practical value. DohaScript is intended to serve as a standardized and reproducible benchmark for advancing research on continuous handwritten Devanagari text in low resource script settings.

翻译：尽管拥有数亿使用者，手写天城体文本在公开可用的基准数据集中仍严重代表性不足。现有资源规模有限，主要集中于孤立字符或短词，且缺乏受控的词汇内容和书写者层面的多样性，这限制了其在现代数据驱动笔迹分析中的实用性。因此，这些资源未能捕捉天城体手写文本连续、融合及结构复杂的特性——在天城体中，字符通过共享的शिरोरेखा（水平顶线）连接，并表现出丰富的连字形态。本文介绍DohaScript，一个从531位独立贡献者处收集的大规模、多书写者手写印地语文本数据集。该数据集被设计为平行风格语料库，所有书写者转录相同的六组传统印地语दोहा（对句）固定集合。这种受控设计使得在独立于语言内容的情况下系统分析书写者特异性变异成为可能，并支持手写识别、书写者身份识别、风格分析和生成建模等任务。数据集附带有非识别性人口统计元数据、基于客观清晰度和分辨率标准的严格质量筛选，以及页面级布局难度标注，便于分层基准测试。基线实验显示出清晰的质量区分度以及对未见书写者的强泛化能力，凸显了数据集的可靠性与实用价值。DohaScript旨在作为一个标准化、可复现的基准，用于推动低资源文字环境下连续手写天城体文本的研究进展。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

PaperOrchestra：一种面向自动化 AI 学术论文撰写的多智能体框架

专知会员服务

13+阅读 · 4月9日

《深度文本哈希综述：基于二进制表示的高效语义文本检索》

专知会员服务

9+阅读 · 2025年11月3日

【牛津大学博士论文】使用多模态学习的手语理解，259页pdf

专知会员服务

20+阅读 · 2024年9月14日

【牛津大学博士论文】从多模态数据中学习表示，258页pdf

专知会员服务

52+阅读 · 2024年7月28日