Document denoising and binarization are fundamental problems in the document processing space, but current datasets are often too small and lack sufficient complexity to effectively train and benchmark modern data-driven machine learning models. To fill this gap, we introduce ShabbyPages, a new document image dataset designed for training and benchmarking document denoisers and binarizers. ShabbyPages contains over 6,000 clean "born digital" images with synthetically-noised counterparts ("shabby pages") that were augmented using the Augraphy document augmentation tool to appear as if they have been printed and faxed, photocopied, or otherwise altered through physical processes. In this paper, we discuss the creation process of ShabbyPages and demonstrate the utility of ShabbyPages by training convolutional denoisers which remove real noise features with a high degree of human-perceptible fidelity, establishing baseline performance for a new ShabbyPages benchmark.
翻译:文档去噪与二值化是文档处理领域的基础问题,但现有数据集通常规模过小且缺乏足够复杂性,难以有效训练和评估现代数据驱动的机器学习模型。为填补这一空白,我们提出ShabbyPages——一个专为训练和评估文档去噪器与二值化器设计的新型文档图像数据集。该数据集包含超过6000张"原生数字"清洁图像及其对应的合成噪声版本("破损页面"),这些噪声图像通过Augraphy文档增强工具进行扩充,模拟出经过打印、传真、复印或其他物理过程处理的视觉效果。本文阐述了ShabbyPages的构建过程,并通过训练卷积去噪器验证其应用价值——该去噪器能在消除真实噪声特征的同时保持极高的人类感知保真度,从而为ShabbyPages基准测试建立性能基线。