In the recent past, complex deep neural networks have received huge interest in various document understanding tasks such as document image classification and document retrieval. As many document types have a distinct visual style, learning only visual features with deep CNNs to classify document images have encountered the problem of low inter-class discrimination, and high intra-class structural variations between its categories. In parallel, text-level understanding jointly learned with the corresponding visual properties within a given document image has considerably improved the classification performance in terms of accuracy. In this paper, we design a self-attention-based fusion module that serves as a block in our ensemble trainable network. It allows to simultaneously learn the discriminant features of image and text modalities throughout the training stage. Besides, we encourage mutual learning by transferring the positive knowledge between image and text modalities during the training stage. This constraint is realized by adding a truncated-Kullback-Leibler divergence loss Tr-KLD-Reg as a new regularization term, to the conventional supervised setting. To the best of our knowledge, this is the first time to leverage a mutual learning approach along with a self-attention-based fusion module to perform document image classification. The experimental results illustrate the effectiveness of our approach in terms of accuracy for the single-modal and multi-modal modalities. Thus, the proposed ensemble self-attention-based mutual learning model outperforms the state-of-the-art classification results based on the benchmark RVL-CDIP and Tobacco-3482 datasets.
翻译:近年来,复杂深度神经网络在文档图像分类、文档检索等文档理解任务中受到广泛关注。由于许多文档类型具有独特的视觉风格,仅通过深度CNN学习视觉特征对文档图像进行分类时,面临类别间区分度低、类别内结构差异大的问题。同时,在文档图像中联合学习文本层面的理解与对应的视觉特征,显著提升了分类准确率。本文设计了一种基于自注意力机制的融合模块,作为集成可训练网络的核心组件,能够在训练阶段同步学习图像与文本模态的判别特征。此外,我们通过在训练过程中传递图像与文本模态间的正向知识来促进互学习,通过将截断式KL散度损失(Tr-KLD-Reg)作为新正则化项嵌入传统监督框架实现该约束。据我们所知,这是首次将互学习方法与自注意力融合模块结合应用于文档图像分类。实验结果表明,该方法在单模态与多模态场景下均具有准确率优势。因此,所提出的基于集成自注意力互学习模型在基准测试集RVL-CDIP和Tobacco-3482上超越了现有最优分类结果。