Despite recent advances, Handwritten Text Recognition (HTR) for Arabic-script languages still lags behind Latin-script HTR. Part of the problem is dataset quality. To help closing this gap, we propose a two-stage framework (CER-HV) for detecting label errors. Stage 1 (CER) is a Character-Error-Rate-based noise detector built on a Convolutional Recurrent Neural Network (CRNN) architecture. Stage 2 (HV) is the Human-In-The-Loop (HITL) Verification of noisy samples detected by the first stage. Applying the CER-HV framework on multiple Arabic-script datasets can identify samples with label errors including transcription, segmentation, orientation, and non-text content errors that can markedly affect HTR performance. These errors were identified by the first stage of the framework with up to 90percent (top-50) precision. We also show that our CRNN achieves state-of-the-art performance across five of the six evaluated datasets, reaching 8.46 percent Character Error Rate (CER) on KHATT (Arabic), 8.22 percent on PHTI (Pashto), 10.59 percent on Ajami, and 10.11% on Muharaf (Arabic), all without any data cleaning. We establish a new baseline of 11.3 percent CER on the PHTD (Persian) dataset. Applying CER-HV improves evaluation CER by up to 1.8 percentage points after dataset cleaning and retraining. Although our experiments focus on documents written in an Arabic-script language, the framework is general and can be applied to other text recognition datasets
翻译:尽管近年来取得了进展,阿拉伯手写文本识别(HTR)仍落后于拉丁语系HTR,部分原因在于数据集质量。为缩小这一差距,我们提出了一种用于检测标签错误的两阶段框架(CER-HV)。第一阶段(CER)是基于字符错误率的噪声检测器,采用卷积循环神经网络(CRNN)架构构建;第二阶段(HV)是对第一阶段检测出的噪声样本进行人工介入验证。将该CER-HV框架应用于多个阿拉伯字数据集,可识别出包含转录、切分、方向及非文本内容错误的标签样本,这些错误会显著影响HTR性能。框架第一阶段对这些错误的识别精度最高可达90%(前50个样本)。我们还证明了所提出的CRNN在六个评估数据集中的五个上达到了最先进性能:在KHATT(阿拉伯语)上字符错误率(CER)为8.46%,在PHTI(普什图语)上为8.22%,在Ajami上为10.59%,在Muharaf(阿拉伯语)上为10.11%,所有结果均未经数据清洗。我们在PHTD(波斯语)数据集上建立了11.3% CER的新基线。应用CER-HR框架进行数据清洗并重新训练后,评估CER最多可提升1.8个百分点。尽管本实验聚焦于阿拉伯字书写系统的文档,但该框架具有通用性,可适用于其他文本识别数据集。