Document classification is considered a critical element in automated document processing systems. In recent years multi-modal approaches have become increasingly popular for document classification. Despite their improvements, these approaches are underutilized in the industry due to their requirement for a tremendous volume of training data and extensive computational power. In this paper, we attempt to address these issues by embedding textual features directly into the visual space, allowing lightweight image-based classifiers to achieve state-of-the-art results using small-scale datasets in document classification. To evaluate the efficacy of the visual features generated from our approach on limited data, we tested on the standard dataset Tobacco-3482. Our experiments show a tremendous improvement in image-based classifiers, achieving an improvement of 4.64% using ResNet50 with no document pre-training. It also sets a new record for the best accuracy of the Tobacco-3482 dataset with a score of 91.14% using the image-based DocXClassifier with no document pre-training. The simplicity of the approach, its resource requirements, and subsequent results provide a good prospect for its use in industrial use cases.
翻译:文档分类被视为自动化文档处理系统中的关键要素。近年来,多模态方法在文档分类领域日益流行。尽管这些方法有所改进,但由于其对海量训练数据和强大计算能力的需求,在工业界仍未得到充分利用。本文试图通过将文本特征直接嵌入视觉空间来解决这些问题,使得轻量级的基于图像的分类器能够在文档分类任务中,利用小规模数据集实现最先进的性能。为评估本方法在有限数据下生成的视觉特征的有效性,我们在标准数据集Tobacco-3482上进行了测试。实验结果表明,基于图像的分类器性能获得显著提升:未进行文档预训练的ResNet50实现了4.64%的性能提升;同时,未进行文档预训练的基于图像的DocXClassifier以91.14%的准确率创造了Tobacco-3482数据集的最佳精度新纪录。该方法具有实现简单、资源需求低的特点,其后续结果为工业应用场景提供了良好前景。