Handwritten text recognition (HTR) for Arabic-script languages still lags behind Latin-script HTR, despite recent advances in model architectures, datasets, and benchmarks. We show that data quality is a significant limiting factor in many published datasets and propose CER-HV (CER-based Ranking with Human Verification) as a framework to detect and clean label errors. CER-HV combines a CER-based noise detector, built on a carefully configured Convolutional Recurrent Neural Network (CRNN) with early stopping to avoid overfitting noisy samples, and a human-in-the-loop (HITL) step that verifies high-ranking samples. The framework reveals that several existing datasets contain previously underreported problems, including transcription, segmentation, orientation, and non-text content errors. These have been identified with up to 90 percent precision in the Muharaf and 80-86 percent in the PHTI datasets. We also show that our CRNN achieves state-of-the-art performance across five of the six evaluated datasets, reaching 8.45 percent Character Error Rate (CER) on KHATT (Arabic), 8.26 percent on PHTI (Pashto), 10.66 percent on Ajami, and 10.11 percent on Muharaf (Arabic), all without any data cleaning. We establish a new baseline of 11.3 percent CER on the PHTD (Persian) dataset. Applying CER-HV improves the evaluation CER by 0.3-0.6 percent on the cleaner datasets and 1.0-1.8 percent on the noisier ones. Although our experiments focus on documents written in an Arabic-script language, including Arabic, Persian, Urdu, Ajami, and Pashto, the framework is general and can be applied to other text recognition datasets.
翻译:尽管近年来在模型架构、数据集和基准测试方面取得了进展,阿拉伯文字手写文本识别(HTR)的性能仍落后于拉丁文字HTR。我们指出,数据质量是许多已发布数据集中的一个显著限制因素,并提出了CER-HV(基于字符错误率的排序与人工验证)作为一种检测和清洗标签错误的框架。CER-HV结合了基于CER的噪声检测器(该检测器构建于一个经过精心配置、采用早停法以避免对噪声样本过拟合的卷积循环神经网络之上)以及一个验证高排名样本的人机协同(HITL)步骤。该框架揭示,多个现有数据集包含先前未被充分报告的问题,包括转录、分割、方向和非文本内容错误。在Muharaf数据集中,这些问题以高达90%的精确度被识别,在PHTI数据集中则为80-86%。我们还表明,我们的CRNN在六个评估数据集中的五个上达到了最先进的性能,在KHATT(阿拉伯语)上字符错误率(CER)达到8.45%,在PHTI(普什图语)上达到8.26%,在Ajami上达到10.66%,在Muharaf(阿拉伯语)上达到10.11%,且均未进行任何数据清洗。我们在PHTD(波斯语)数据集上建立了11.3% CER的新基线。应用CER-HV后,在较干净的数据集上评估CER改善了0.3-0.6%,在噪声较多的数据集上改善了1.0-1.8%。尽管我们的实验专注于阿拉伯文字语言(包括阿拉伯语、波斯语、乌尔都语、Ajami和普什图语)书写的文档,但该框架具有通用性,可应用于其他文本识别数据集。