DocBinFormer: A Two-Level Transformer Network for Effective Document Image Binarization

In real life, various degradation scenarios exist that might damage document images, making it harder to recognize and analyze them, thus binarization is a fundamental and crucial step for achieving the most optimal performance in any document analysis task. We propose DocBinFormer (Document Binarization Transformer), a novel two-level vision transformer (TL-ViT) architecture based on vision transformers for effective document image binarization. The presented architecture employs a two-level transformer encoder to effectively capture both global and local feature representation from the input images. These complimentary bi-level features are exploited for efficient document image binarization, resulting in improved results for system-generated as well as handwritten document images in a comprehensive approach. With the absence of convolutional layers, the transformer encoder uses the pixel patches and sub-patches along with their positional information to operate directly on them, while the decoder generates a clean (binarized) output image from the latent representation of the patches. Instead of using a simple vision transformer block to extract information from the image patches, the proposed architecture uses two transformer blocks for greater coverage of the extracted feature space on a global and local scale. The encoded feature representation is used by the decoder block to generate the corresponding binarized output. Extensive experiments on a variety of DIBCO and H-DIBCO benchmarks show that the proposed model outperforms state-of-the-art techniques on four metrics. The source code will be made available at https://github.com/RisabBiswas/DocBinFormer.

翻译：在现实场景中，存在多种可能损坏文档图像的退化情况，导致其识别与分析更加困难。因此，二值化是任何文档分析任务中实现最优性能的基础且关键步骤。本文提出DocBinFormer（文档二值化Transformer），一种基于视觉Transformer的新型两级视觉Transformer（TL-ViT）架构，用于高效的文档图像二值化。该架构采用两级Transformer编码器，有效捕捉输入图像的全局和局部特征表示。这些互补的双层级特征被用于高效文档图像二值化，以综合方法提升了系统生成文档图像及手写文档图像的处理效果。在无卷积层的情况下，Transformer编码器直接对像素块和子块及其位置信息进行操作，而解码器则从块的潜在表示中生成清晰的二值化输出图像。为更全面地覆盖全局和局部尺度上的特征空间，所提架构采用两个Transformer块替代简单视觉Transformer块来提取图像块中的信息。解码器模块利用编码后的特征表示生成对应的二值化输出。在多种DIBCO和H-DIBCO基准测试上的大量实验表明，所提模型在四项指标上均优于现有最先进技术。源代码将发布在https://github.com/RisabBiswas/DocBinFormer。