The digitisation of historical documents has traditionally been conceived as a process limited to character-level transcription, producing flat text that lacks the structural and semantic information necessary for substantive computational analysis. We present VERITAS (Vision-Enhanced Reading, Interpretation, and Transcription of Archival Sources), a modular, model-agnostic framework that reconceptualises digitisation as an integrated workflow encompassing transcription, layout analysis, and semantic enrichment. The pipeline is organised into four stages - Preprocessing, Extraction, Refinement, and Enrichment - and employs a schema-driven architecture that allows researchers to declaratively specify their extraction objectives. We evaluate VERITAS on the critical edition of Bernardino Corio's Storia di Milano, a Renaissance chronicle of over 1,600 pages. Results demonstrate that the pipeline achieves a 67.6% relative reduction in word error rate compared to a commercial OCR baseline, with a threefold reduction in end-to-end processing time when accounting for manual correction. We further illustrate the downstream utility of the pipeline's output by querying the transcribed corpus through a retrieval-augmented generation system, demonstrating its capacity to support historical inquiry.
翻译:历史文献的数字化传统上被理解为局限于字符级转录的过程,产生缺乏结构化和语义信息的平面文本,难以支持实质性的计算分析。我们提出了VERITAS(Vision-Enhanced Reading, Interpretation, and Transcription of Archival Sources,视觉增强的档案资源阅读、解释与转录框架),这是一种模块化、模型无关的框架,将数字化重新构想为一个涵盖转录、布局分析与语义增强的集成工作流。该流水线分为四个阶段——预处理、提取、优化和富集——并采用基于模式的架构,允许研究人员声明式地指定其抽取目标。我们在贝纳迪诺·科里奥的《米兰史》批判版上评估了VERITAS,这是一部超过1600页的文艺复兴编年史。结果表明,与商业OCR基线相比,该流水线的词错误率相对降低了67.6%,在考虑手动校正的情况下,端到端处理时间缩短了三分之二。我们进一步通过检索增强生成系统查询转录文本,展示了流水线输出的下游实用性,证明其支持历史探究的能力。