Zero-Shot Learning (ZSL), which aims at automatically recognizing unseen objects, is a promising learning paradigm to understand new real-world knowledge for machines continuously. Recently, the Knowledge Graph (KG) has been proven as an effective scheme for handling the zero-shot task with large-scale and non-attribute data. Prior studies always embed relationships of seen and unseen objects into visual information from existing knowledge graphs to promote the cognitive ability of the unseen data. Actually, real-world knowledge is naturally formed by multimodal facts. Compared with ordinary structural knowledge from a graph perspective, multimodal KG can provide cognitive systems with fine-grained knowledge. For example, the text description and visual content can depict more critical details of a fact than only depending on knowledge triplets. Unfortunately, this multimodal fine-grained knowledge is largely unexploited due to the bottleneck of feature alignment between different modalities. To that end, we propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings via a designed dense attention module and self-calibration loss. It makes the semantic transfer process of our ZSL framework learns more differentiated knowledge between entities. Our model also gets rid of the performance limitation of only using rough global features. We conduct extensive experiments and evaluate our model on large-scale real-world data. The experimental results clearly demonstrate the effectiveness of the proposed model in standard zero-shot classification tasks.
翻译:零样本学习旨在自动识别未见物体,是一种使机器持续理解现实世界新知识的具前景学习范式。近年来,知识图谱已被证明是处理大规模、非属性数据零样本任务的有效方案。现有研究通常将已知与未知物体的关系嵌入现有知识图谱的视觉信息中,以提升对未知数据的认知能力。实际上,现实世界知识天然由多模态事实构成。与普通图结构知识相比,多模态知识图谱能为认知系统提供细粒度知识。例如,文本描述与视觉内容能比仅依赖知识三元组更详细地刻画事实的关键细节。然而,受限于不同模态间的特征对齐瓶颈,这种多模态细粒度知识尚未被充分开发。为此,我们提出一种多模态密集零样本学习框架,通过设计的密集注意力模块和自校准损失函数,将图像区域与对应语义嵌入进行匹配。这使得我们零样本框架的语义迁移过程能够学习实体间更具差异化的知识。该模型突破了仅使用粗糙全局特征的性能限制。我们在大规模真实数据上进行了广泛实验与评估。实验结果充分证明了所提模型在标准零样本分类任务中的有效性。