WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/

翻译：开放域视觉实体识别旨在将图像与百科全书式知识库（如维基百科）中的实体相关联。近期针对该任务设计的生成式方法展现出优异性能，但计算成本高昂，限制了其可扩展性与实际部署。本研究重新审视了视觉实体识别的对比学习范式，提出了WikiCLIP——一个简洁而有效的框架，为开放域视觉实体识别建立了强大且高效的基线。WikiCLIP利用大语言模型嵌入作为富含知识的实体表示，并通过视觉引导知识适配器增强表示能力，该适配器在图像块层级将文本语义与视觉线索对齐。为进一步促进细粒度判别，训练过程中采用硬负样本合成机制生成视觉相似但语义相异的负样本。在OVEN等主流开放域视觉实体识别基准上的实验结果表明，WikiCLIP显著优于现有强基线模型。具体而言，WikiCLIP在极具挑战性的OVEN未见类别集上实现了16%的性能提升，同时相比领先的生成式模型AutoVER将推理延迟降低了近100倍。项目页面详见：https://artanic30.github.io/project_pages/WikiCLIP/

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

基于深度神经网络的高效视觉识别研究进展与新方向

专知会员服务

40+阅读 · 2021年8月31日

【经典书】《学习OpenCV 3》，1018页pdf

专知会员服务

133+阅读 · 2021年2月28日

领域知识图谱研究综述

专知会员服务

147+阅读 · 2020年8月2日

不可错过！斯坦福大学《知识图谱》课程，Jure等业界顶尖大牛讲述知识图谱技术进展，附PPT下载

专知会员服务

91+阅读 · 2020年6月18日