In real life, various degradation scenarios exist that might damage document images, making it harder to recognize and analyze them, thus binarization is a fundamental and crucial step for achieving the most optimal performance in any document analysis task. We propose DocBinFormer (Document Binarization Transformer), a novel two-level vision transformer (TL-ViT) architecture based on vision transformers for effective document image binarization. The presented architecture employs a two-level transformer encoder to effectively capture both global and local feature representation from the input images. These complimentary bi-level features are exploited for efficient document image binarization, resulting in improved results for system-generated as well as handwritten document images in a comprehensive approach. With the absence of convolutional layers, the transformer encoder uses the pixel patches and sub-patches along with their positional information to operate directly on them, while the decoder generates a clean (binarized) output image from the latent representation of the patches. Instead of using a simple vision transformer block to extract information from the image patches, the proposed architecture uses two transformer blocks for greater coverage of the extracted feature space on a global and local scale. The encoded feature representation is used by the decoder block to generate the corresponding binarized output. Extensive experiments on a variety of DIBCO and H-DIBCO benchmarks show that the proposed model outperforms state-of-the-art techniques on four metrics. The source code will be made available at https://github.com/RisabBiswas/DocBinFormer.
翻译:在现实场景中,存在多种可能损坏文档图像的退化情况,导致其识别与分析更加困难。因此,二值化是任何文档分析任务中实现最优性能的基础且关键步骤。本文提出DocBinFormer(文档二值化Transformer),一种基于视觉Transformer的新型两级视觉Transformer(TL-ViT)架构,用于高效的文档图像二值化。该架构采用两级Transformer编码器,有效捕捉输入图像的全局和局部特征表示。这些互补的双层级特征被用于高效文档图像二值化,以综合方法提升了系统生成文档图像及手写文档图像的处理效果。在无卷积层的情况下,Transformer编码器直接对像素块和子块及其位置信息进行操作,而解码器则从块的潜在表示中生成清晰的二值化输出图像。为更全面地覆盖全局和局部尺度上的特征空间,所提架构采用两个Transformer块替代简单视觉Transformer块来提取图像块中的信息。解码器模块利用编码后的特征表示生成对应的二值化输出。在多种DIBCO和H-DIBCO基准测试上的大量实验表明,所提模型在四项指标上均优于现有最先进技术。源代码将发布在https://github.com/RisabBiswas/DocBinFormer。