Optical Character Recognition (OCR) of eighteenth-century printed texts remains challenging due to degraded print quality, archaic glyphs, and non-standardized orthography. Although transformer-based OCR systems and Vision-Language Models (VLMs) achieve strong aggregate accuracy, metrics such as Character Error Rate (CER) and Word Error Rate (WER) provide limited insight into their reliability for scholarly use. We compare a dedicated OCR transformer (TrOCR) and a general-purpose Vision-Language Model (Qwen) on line-level historical English texts using length-weighted accuracy metrics and hypothesis driven error analysis. While Qwen achieves lower CER/WER and greater robustness to degraded input, it exhibits selective linguistic regularization and orthographic normalization that may silently alter historically meaningful forms. TrOCR preserves orthographic fidelity more consistently but is more prone to cascading error propagation. Our findings show that architectural inductive biases shape OCR error structure in systematic ways. Models with similar aggregate accuracy can differ substantially in error locality, detectability, and downstream scholarly risk, underscoring the need for architecture-aware evaluation in historical digitization workflows.
翻译:十八世纪印刷文本的光学字符识别(OCR)由于印刷质量退化、字形古老以及拼写非标准化而仍然具有挑战性。尽管基于Transformer的OCR系统和视觉语言模型(VLM)在整体准确率上表现优异,但字符错误率(CER)和词错误率(WER)等指标对其学术应用可靠性的揭示有限。本研究采用长度加权准确率指标和基于假设的错误分析方法,在行级历史英文文本上对比了专用OCR Transformer模型(TrOCR)与通用视觉语言模型(Qwen)。虽然Qwen取得了更低的CER/WER值,并对退化输入表现出更强的鲁棒性,但它显示出选择性语言正则化和拼写归一化倾向,可能无声地篡改具有历史意义的形式。TrOCR在拼写保真度方面表现更为一致,但更容易出现级联错误传播。我们的研究结果表明,架构归纳偏置以系统化的方式塑造了OCR错误结构。具有相似整体准确率的模型在错误局部性、可检测性及下游学术风险方面可能存在显著差异,这强调了在历史文献数字化工作流程中开展架构感知评估的必要性。