CER-HV: A CER-Based Human-in-the-Loop Framework for Cleaning Datasets Applied to Arabic-Script HTR

Handwritten text recognition (HTR) for Arabic-script languages still lags behind Latin-script HTR, despite recent advances in model architectures, datasets, and benchmarks. We show that data quality is a significant limiting factor in many published datasets and propose CER-HV (CER-based Ranking with Human Verification) as a framework to detect and clean label errors. CER-HV combines a CER-based noise detector, built on a carefully configured Convolutional Recurrent Neural Network (CRNN) with early stopping to avoid overfitting noisy samples, and a human-in-the-loop (HITL) step that verifies high-ranking samples. The framework reveals that several existing datasets contain previously underreported problems, including transcription, segmentation, orientation, and non-text content errors. These have been identified with up to 90 percent precision in the Muharaf and 80-86 percent in the PHTI datasets. We also show that our CRNN achieves state-of-the-art performance across five of the six evaluated datasets, reaching 8.45 percent Character Error Rate (CER) on KHATT (Arabic), 8.26 percent on PHTI (Pashto), 10.66 percent on Ajami, and 10.11 percent on Muharaf (Arabic), all without any data cleaning. We establish a new baseline of 11.3 percent CER on the PHTD (Persian) dataset. Applying CER-HV improves the evaluation CER by 0.3-0.6 percent on the cleaner datasets and 1.0-1.8 percent on the noisier ones. Although our experiments focus on documents written in an Arabic-script language, including Arabic, Persian, Urdu, Ajami, and Pashto, the framework is general and can be applied to other text recognition datasets.

翻译：尽管近年来在模型架构、数据集和基准测试方面取得了进展，阿拉伯文字手写文本识别（HTR）的性能仍落后于拉丁文字HTR。我们指出，数据质量是许多已发布数据集中的一个显著限制因素，并提出了CER-HV（基于字符错误率的排序与人工验证）作为一种检测和清洗标签错误的框架。CER-HV结合了基于CER的噪声检测器（该检测器构建于一个经过精心配置、采用早停法以避免对噪声样本过拟合的卷积循环神经网络之上）以及一个验证高排名样本的人机协同（HITL）步骤。该框架揭示，多个现有数据集包含先前未被充分报告的问题，包括转录、分割、方向和非文本内容错误。在Muharaf数据集中，这些问题以高达90%的精确度被识别，在PHTI数据集中则为80-86%。我们还表明，我们的CRNN在六个评估数据集中的五个上达到了最先进的性能，在KHATT（阿拉伯语）上字符错误率（CER）达到8.45%，在PHTI（普什图语）上达到8.26%，在Ajami上达到10.66%，在Muharaf（阿拉伯语）上达到10.11%，且均未进行任何数据清洗。我们在PHTD（波斯语）数据集上建立了11.3% CER的新基线。应用CER-HV后，在较干净的数据集上评估CER改善了0.3-0.6%，在噪声较多的数据集上改善了1.0-1.8%。尽管我们的实验专注于阿拉伯文字语言（包括阿拉伯语、波斯语、乌尔都语、Ajami和普什图语）书写的文档，但该框架具有通用性，可应用于其他文本识别数据集。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【ICML2025】一图胜千言：一种可用性可保留的文本-图像协同擦除框架

专知会员服务

4+阅读 · 2025年5月19日

《基于边缘智能的可穿戴多模态手势识别》美空军2023最新38页报告

专知会员服务

50+阅读 · 2023年4月28日

【AAAI2023】DPText-DETR: 基于动态点query的场景文本检测，更高更快更鲁棒

专知会员服务

17+阅读 · 2023年1月23日

[CVPR 2021] 序列到序列对比学习的文本识别

专知会员服务

29+阅读 · 2021年4月14日