Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/
翻译:开放域视觉实体识别旨在将图像与百科全书式知识库(如维基百科)中的实体相关联。近期针对该任务设计的生成式方法展现出优异性能,但计算成本高昂,限制了其可扩展性与实际部署。本研究重新审视了视觉实体识别的对比学习范式,提出了WikiCLIP——一个简洁而有效的框架,为开放域视觉实体识别建立了强大且高效的基线。WikiCLIP利用大语言模型嵌入作为富含知识的实体表示,并通过视觉引导知识适配器增强表示能力,该适配器在图像块层级将文本语义与视觉线索对齐。为进一步促进细粒度判别,训练过程中采用硬负样本合成机制生成视觉相似但语义相异的负样本。在OVEN等主流开放域视觉实体识别基准上的实验结果表明,WikiCLIP显著优于现有强基线模型。具体而言,WikiCLIP在极具挑战性的OVEN未见类别集上实现了16%的性能提升,同时相比领先的生成式模型AutoVER将推理延迟降低了近100倍。项目页面详见:https://artanic30.github.io/project_pages/WikiCLIP/