Zero-shot learning has been extensively investigated in the broader field of visual recognition, attracting significant interest recently. However, the current work on zero-shot learning in document image classification remains scarce. The existing studies either focus exclusively on zero-shot inference, or their evaluation does not align with the established criteria of zero-shot evaluation in the visual recognition domain. We provide a comprehensive document image classification analysis in Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL) settings to address this gap. Our methodology and evaluation align with the established practices of this domain. Additionally, we propose zero-shot splits for the RVL-CDIP dataset. Furthermore, we introduce CICA (pronounced 'ki-ka'), a framework that enhances the zero-shot learning capabilities of CLIP. CICA consists of a novel 'content module' designed to leverage any generic document-related textual information. The discriminative features extracted by this module are aligned with CLIP's text and image features using a novel 'coupled-contrastive' loss. Our module improves CLIP's ZSL top-1 accuracy by 6.7% and GZSL harmonic mean by 24% on the RVL-CDIP dataset. Our module is lightweight and adds only 3.3% more parameters to CLIP. Our work sets the direction for future research in zero-shot document classification.
翻译:零样本学习在视觉识别领域已得到广泛研究,近年来吸引了大量关注。然而,当前针对文档图像分类的零样本学习研究仍然稀缺。现有研究或仅聚焦于零样本推理,或其实验评估与视觉识别领域公认的零样本评价标准存在偏差。为弥补这一空白,我们提供了零样本学习(ZSL)和广义零样本学习(GZSL)设定下文档图像分类的综合分析,其方法论与评估严格遵循该领域的成熟实践。此外,我们为RVL-CDIP数据集提出了零样本划分方案。进一步地,我们提出CICA(发音为'ki-ka')框架,用于增强CLIP模型的零样本学习能力。CICA包含一个创新性的"内容模块",旨在利用任何通用文档相关的文本信息。该模块提取的判别性特征通过新颖的"耦合对比"损失函数与CLIP的文本及图像特征进行对齐。在RVL-CDIP数据集上,我们的模块将CLIP的ZSL top-1准确率提升6.7%,GZSL调和平均值提升24%。该模块轻量化设计,仅为CLIP增加3.3%的参数。本研究为零样本文档分类的未来研究方向奠定了基础。