Progress in Automated Handwriting Recognition has been hampered by the lack of large training datasets. Nearly all research uses a set of small datasets that often cause models to overfit. We present CENSUS-HWR, a new dataset consisting of full English handwritten words in 1,812,014 gray scale images. A total of 1,865,134 handwritten texts from a vocabulary of 10,711 words in the English language are present in this collection. This dataset is intended to serve handwriting models as a benchmark for deep learning algorithms. This huge English handwriting recognition dataset has been extracted from the US 1930 and 1940 censuses taken by approximately 70,000 enumerators each year. The dataset and the trained model with their weights are freely available to download at https://censustree.org/data.html.
翻译:自动手写识别技术的进展长期受到大规模训练数据集匮乏的制约。目前几乎所有研究均依赖若干小型数据集,导致模型常出现过拟合现象。本文提出CENSUS-HWR——一个包含1,812,014张灰度图像、由完整英文手写单词构成的新数据集。该集合共收录源自英语语言10,711个词汇的1,865,134个手写文本样本,旨在为深度学习算法提供手写识别模型的基准测试平台。这一大型英文手写识别数据集源自美国1930年及1940年人口普查档案,每年由约7万名普查员采集完成。数据集及其配套的已训练模型权重可通过https://censustree.org/data.html 免费获取。