Historical documents encompass a wealth of cultural treasures but suffer from severe damages including character missing, paper damage, and ink erosion over time. However, existing document processing methods primarily focus on binarization, enhancement, etc., neglecting the repair of these damages. To this end, we present a new task, termed Historical Document Repair (HDR), which aims to predict the original appearance of damaged historical documents. To fill the gap in this field, we propose a large-scale dataset HDR28K and a diffusion-based network DiffHDR for historical document repair. Specifically, HDR28K contains 28,552 damaged-repaired image pairs with character-level annotations and multi-style degradations. Moreover, DiffHDR augments the vanilla diffusion framework with semantic and spatial information and a meticulously designed character perceptual loss for contextual and visual coherence. Experimental results demonstrate that the proposed DiffHDR trained using HDR28K significantly surpasses existing approaches and exhibits remarkable performance in handling real damaged documents. Notably, DiffHDR can also be extended to document editing and text block generation, showcasing its high flexibility and generalization capacity. We believe this study could pioneer a new direction of document processing and contribute to the inheritance of invaluable cultures and civilizations. The dataset and code is available at https://github.com/yeungchenwa/HDR.
翻译:历史文献蕴含丰富的文化瑰宝,但历经岁月侵蚀常遭受字符缺失、纸张破损及墨迹褪化等严重损害。然而,现有文献处理方法主要集中于二值化、增强等技术,未能有效修复此类损伤。为此,本文提出一项新任务——历史文献修复(HDR),旨在预测受损历史文献的原始外观。为填补该领域空白,我们构建了大规模数据集HDR28K,并提出基于扩散模型的修复网络DiffHDR。具体而言,HDR28K包含28,552组受损-修复图像对,具备字符级标注与多风格退化类型。此外,DiffHDR在基础扩散框架中融合语义空间信息,并设计了精细化的字符感知损失函数以保持上下文连贯性与视觉一致性。实验结果表明,基于HDR28K训练的DiffHDR显著超越现有方法,在处理真实受损文献时表现出卓越性能。值得注意的是,DiffHDR还可扩展应用于文献编辑与文本块生成任务,展现出高度灵活性与泛化能力。我们相信本研究能为文献处理开辟新方向,并为珍贵文化遗产的传承作出贡献。数据集与代码已公开于https://github.com/yeungchenwa/HDR。