Page image classifier fine-tuned on century-spanning archives of scanned documents for further content-specific processing

Purpose: Digitization projects in the humanities produce vast, heterogeneous archives of historical documents, making manual sorting impractical at scale. This work addresses the need for an automated system to classify scanned page images based on visual content type - text, tables, and graphics - enabling content-specific downstream processing such as Optical Character Recognition (OCR) or structured data extraction. Methods: An image classification system was developed and evaluated on a dataset of over 48,000 annotated historical page images from century-old Czech archaeological archives, refined through four successive annotation stages with domain-expert review. A Random Forest Classifier baseline was established using hand-crafted image features. Subsequently, deep learning architectures were fine-tuned and compared: Convolutional Neural Networks (EfficientNetV2, RegNetY), Vision and Document Image Transformers (ViT, DiT), and multimodal CLIP models. An 11-category label scheme was designed collaboratively with domain experts and evaluated via five-fold cross-validation. Results: The feature-based baseline achieved approximately 75% accuracy. Fine-tuned CNNs and Transformers substantially outperformed it, with RegNetY-16GF achieving 99.16% and ViT-large 99.12% Top-1 accuracy on the held-out test set. CLIP ViT-B/16 reached 99.14% with optimized text descriptions. Conclusion: Image-only models, particularly RegNetY-16GF, deliver near-perfect classification accuracy and produce consistent labels across 649,508 unlabeled archival pages with over 90% inter-model agreement. Fine-tuned CLIP, despite competitive test-set accuracy, showed under 65% agreement with image-only models on unlabeled data, making it less suitable for deployment. The final models, annotated dataset, and software are publicly available under open-source licenses.

翻译：目的：人文学科数字化项目产生海量、异构的历史文献档案，使得人工分类在规模上难以实现。本研究旨在解决对扫描页面图像基于视觉内容类型（文本、表格和图形）进行自动分类的需求，以支持内容特定的下游处理，如光学字符识别（OCR）或结构化数据提取。方法：开发了一个图像分类系统，并在来自捷克百年考古档案的超过48,000张带注释历史页面图像数据集上进行评估，通过四个连续的标注阶段和领域专家评审进行优化。采用基于手工图像特征的随机森林分类器作为基线。随后，微调并比较了多种深度学习架构：卷积神经网络（EfficientNetV2、RegNetY）、视觉与文档图像Transformer（ViT、DiT）以及多模态CLIP模型。与领域专家协作设计了11类标签方案，并通过五折交叉验证进行评估。结果：基于特征的基线达到了约75%的准确率。微调后的CNN和Transformer显著优于基线，其中RegNetY-16GF在保留测试集上达到了99.16%的Top-1准确率，ViT-large达到99.12%。CLIP ViT-B/16在优化文本描述后达到99.14%。结论：纯图像模型，尤其是RegNetY-16GF，提供了近乎完美的分类准确率，并在649,508张未标注档案页面上生成了一致的标签，模型间一致率超过90%。尽管微调后的CLIP在测试集上具有竞争力，但在未标注数据上与纯图像模型的一致率低于65%，使其不太适合部署。最终模型、带注释数据集及软件均已根据开源许可证公开发布。