Optical Character Recognition (OCR) is a critical but error-prone stage in digital humanities text pipelines. While OCR correction improves usability for downstream NLP tasks, common workflows often overwrite intermediate decisions, obscuring how textual transformations affect scholarly interpretation. We present a provenance-aware framework for OCR-corrected humanities corpora that records correction lineage at the span level, including edit type, correction source, confidence, and revision status. Using a pilot corpus of historical texts, we compare downstream named entity extraction across raw OCR, fully corrected text, and provenance-filtered corrections. Our results show that correction pathways can substantially alter extracted entities and document-level interpretations, while provenance signals help identify unstable outputs and prioritize human review. We argue that provenance should be treated as a first-class analytical layer in NLP for digital humanities, supporting reproducibility, source criticism, and uncertainty-aware interpretation.
翻译:光学字符识别(OCR)是数字人文文本处理流程中关键但易出错的环节。尽管OCR校正提升了文本对下游自然语言处理任务的可用性,但常见工作流常覆盖中间决策过程,使文本转换如何影响学术阐释变得模糊。本文提出一种面向OCR校正人文语料库的溯源感知框架,该框架在文本片段层级记录校正谱系,包括编辑类型、校正来源、置信度与修订状态。基于历史文本的试点语料库,我们比较了原始OCR文本、完全校正文本及经溯源过滤的校正文本在下游命名实体抽取中的表现。结果表明,校正路径会显著改变抽取的实体及文档层级的阐释,而溯源信号有助于识别不稳定输出并优先安排人工核查。我们认为,在数字人文的自然语言处理中,溯源应被视为一级分析层,以支持可复现性、来源批判及不确定性感知的阐释。