The Character Error Vector: Decomposable errors for page-level OCR evaluation

The Character Error Rate (CER) is a key metric for evaluating the quality of Optical Character Recognition (OCR). However, this metric assumes that text has been perfectly parsed, which is often not the case. Under page-parsing errors, CER becomes undefined, limiting its use as a metric and making evaluating page-level OCR challenging, particularly when using data that do not share a labelling schema. We introduce the Character Error Vector (CEV), a bag-of-characters evaluator for OCR. The CEV can be decomposed into parsing and OCR, and interaction error components. This decomposability allows practitioners to focus on the part of the Document Understanding pipeline that will have the greatest impact on overall text extraction quality. The CEV can be implemented using a variety of methods, of which we demonstrate SpACER (Spatially Aware Character Error Rate) and a Character distribution method using the Jensen-Shannon Distance. We validate the CEV's performance against other metrics: first, the relationship with CER; then, parse quality; and finally, as a direct measure of page-level OCR quality. The validation process shows that the CEV is a valuable bridge between parsing metrics and local metrics like CER. We analyse a dataset of archival newspapers made of degraded images with complex layouts and find that state-of-the-art end-to-end models are outperformed by more traditional pipeline approaches. Whilst the CEV requires character-level positioning for optimal triage, thresholding on easily available values can predict the main error source with an F1 of 0.91. We provide the CEV as part of a Python library to support Document understanding research.

翻译：字符错误率（CER）是衡量光学字符识别（OCR）质量的核心指标。然而，该指标假设文本已被完美解析，而实际情况往往并非如此。在页面解析错误的情况下，CER将失去定义，这限制了其作为度量标准的可用性，并使页面级OCR的评估变得困难，尤其是在使用不具备统一标注模式的数据时。我们提出字符错误向量（CEV），一种面向OCR的字符包评估方法。CEV可分解为解析误差、OCR误差以及交互误差分量。这种可分解性使得实践者能够专注于文档理解流水线中对整体文本提取质量影响最大的环节。CEV可通过多种方法实现，其中我们演示了SpACER（空间感知字符错误率）和基于詹森-香农距离的字符分布方法。我们通过以下三个维度验证CEV的性能：首先是与CER的关系，其次是解析质量，最后是作为页面级OCR质量的直接度量。验证过程表明，CEV是连接解析指标与CER等局部指标的重要桥梁。我们分析了由包含复杂版面的退化图像组成的档案报纸数据集，发现端到端前沿模型的表现不如更传统的流水线方法。尽管CEV需要字符级定位以实现最优分类，但基于易获取数值的阈值化处理能以0.91的F1值预测主要误差来源。我们已将CEV作为Python库的一部分公开，以支持文档理解研究。