Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they disregard that semantic information is not equivalent between them, resulting in a suboptimal alignment. In this work, we propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts. Specifically, we propose a semantic decomposition module to generate multi-view semantic embeddings from visual and textual sides, providing the basic concepts for partial alignment. To alleviate the issue of information redundancy among embeddings, we propose the local-to-semantic variance loss to capture distinct local details and multiple semantic diversity loss to enforce orthogonality among embeddings. Subsequently, two losses are introduced to partially align visual-semantic embedding pairs according to their semantic relevance at the view and word-to-patch levels. Consequently, we consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning. Qualitatively, we show that our model learns the interpretable partial association.
翻译:近期研究表明,百科全书文档可作为零样本学习的有效辅助信息。现有方法通过将文档的整体语义与对应图像对齐来实现知识迁移。然而,这些方法忽略了文档与图像之间的语义信息并非完全对等,导致对齐效果欠佳。本研究提出一种新颖的网络架构,用于从文档和图像中提取多视角语义概念,并对匹配概念而非完整概念进行对齐。具体而言,我们设计了语义分解模块,从视觉与文本两侧生成多视角语义嵌入,为部分对齐提供基础概念。为缓解嵌入向量间的信息冗余问题,我们提出局部-语义方差损失以捕捉独特的局部细节,并引入多重语义多样性损失以增强嵌入向量间的正交性。随后,通过两种损失函数在视角层面和词块-图像块层面,根据语义相关性对视觉-语义嵌入对进行部分对齐。实验结果表明,在基于文档的零样本学习的三个标准基准测试中,采用两种文档来源时,我们的方法均持续优于现有最优方法。定性分析显示,本模型能够学习可解释的部分关联关系。