Open-domain visual entity recognition aims to identify and link entities depicted in images to a vast and evolving set of real-world concepts, such as those found in Wikidata. Unlike conventional classification tasks with fixed label sets, it operates under open-set conditions, where most target entities are unseen during training and exhibit long-tail distributions. This makes the task inherently challenging due to limited supervision, high visual ambiguity, and the need for semantic disambiguation. In this work, we propose a Knowledge-guided Contrastive Learning (KnowCoL) framework that combines both images and text descriptions into a shared semantic space grounded by structured information from Wikidata. By abstracting visual and textual inputs to a conceptual level, the model leverages entity descriptions, type hierarchies, and relational context to support zero-shot entity recognition. We evaluate our approach on the OVEN benchmark, a large-scale open-domain visual recognition dataset with Wikidata IDs as the label space. Our experiments show that using visual, textual, and structured knowledge greatly improves accuracy, especially for rare and unseen entities. Our smallest model improves the accuracy on unseen entities by 10.5% compared to the state-of-the-art, despite being 35 times smaller.
翻译:开放域视觉实体识别旨在识别图像中描绘的实体,并将其链接至庞大且不断演化的真实世界概念集合(例如 Wikidata 中的概念)。与具有固定标签集的传统分类任务不同,该任务在开放集条件下运行,其中大多数目标实体在训练期间不可见,且呈现长尾分布。由于监督有限、视觉歧义性高以及需要进行语义消歧,该任务本质上具有挑战性。本研究提出一种知识引导的对比学习框架,该框架将图像和文本描述结合到一个基于 Wikidata 结构化信息的共享语义空间中。通过将视觉和文本输入抽象到概念层面,模型利用实体描述、类型层次结构和关系上下文来支持零样本实体识别。我们在 OVEN 基准测试上评估了所提方法,该基准是一个以 Wikidata ID 作为标签空间的大规模开放域视觉识别数据集。实验表明,结合视觉、文本和结构化知识能显著提升准确率,尤其是对于罕见和未见过的实体。我们最小的模型在未见实体上的准确率相比当前最优方法提升了 10.5%,而其参数量仅为后者的 1/35。