A Human-in-the-Loop Label Error Detection Framework Applied to Arabic-Script HTR Datasets

Despite recent advances, Handwritten Text Recognition (HTR) for Arabic-script languages still lags behind Latin-script HTR. Part of the problem is dataset quality. To help closing this gap, we propose a two-stage framework (CER-HV) for detecting label errors. Stage 1 (CER) is a Character-Error-Rate-based noise detector built on a Convolutional Recurrent Neural Network (CRNN) architecture. Stage 2 (HV) is the Human-In-The-Loop (HITL) Verification of noisy samples detected by the first stage. Applying the CER-HV framework on multiple Arabic-script datasets can identify samples with label errors including transcription, segmentation, orientation, and non-text content errors that can markedly affect HTR performance. These errors were identified by the first stage of the framework with up to 90percent (top-50) precision. We also show that our CRNN achieves state-of-the-art performance across five of the six evaluated datasets, reaching 8.46 percent Character Error Rate (CER) on KHATT (Arabic), 8.22 percent on PHTI (Pashto), 10.59 percent on Ajami, and 10.11% on Muharaf (Arabic), all without any data cleaning. We establish a new baseline of 11.3 percent CER on the PHTD (Persian) dataset. Applying CER-HV improves evaluation CER by up to 1.8 percentage points after dataset cleaning and retraining. Although our experiments focus on documents written in an Arabic-script language, the framework is general and can be applied to other text recognition datasets

翻译：尽管近年来取得了进展，阿拉伯手写文本识别（HTR）仍落后于拉丁语系HTR，部分原因在于数据集质量。为缩小这一差距，我们提出了一种用于检测标签错误的两阶段框架（CER-HV）。第一阶段（CER）是基于字符错误率的噪声检测器，采用卷积循环神经网络（CRNN）架构构建；第二阶段（HV）是对第一阶段检测出的噪声样本进行人工介入验证。将该CER-HV框架应用于多个阿拉伯字数据集，可识别出包含转录、切分、方向及非文本内容错误的标签样本，这些错误会显著影响HTR性能。框架第一阶段对这些错误的识别精度最高可达90%（前50个样本）。我们还证明了所提出的CRNN在六个评估数据集中的五个上达到了最先进性能：在KHATT（阿拉伯语）上字符错误率（CER）为8.46%，在PHTI（普什图语）上为8.22%，在Ajami上为10.59%，在Muharaf（阿拉伯语）上为10.11%，所有结果均未经数据清洗。我们在PHTD（波斯语）数据集上建立了11.3% CER的新基线。应用CER-HR框架进行数据清洗并重新训练后，评估CER最多可提升1.8个百分点。尽管本实验聚焦于阿拉伯字书写系统的文档，但该框架具有通用性，可适用于其他文本识别数据集。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《人工智能生成式文本检测：数据集和数据生成》最新39页报告

专知会员服务

32+阅读 · 2024年12月18日

【AAAI2023】DPText-DETR: 基于动态点query的场景文本检测，更高更快更鲁棒

专知会员服务

17+阅读 · 2023年1月23日

【开放书】《数字人脸操作与检测手册》，481pdf，Handbook of Digital Face Manipulationand Detection

专知会员服务

22+阅读 · 2022年3月24日

[CVPR 2021] 序列到序列对比学习的文本识别

专知会员服务

29+阅读 · 2021年4月14日