In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia. One way of approaching a problem of such scale is using dual-encoder models (eg CLIP), where all the entity names and query images are embedded into a unified space, paving the way for an approximate k-NN search. Alternatively, it is also possible to re-purpose a captioning model to directly generate the entity names for a given image. In contrast, we introduce a novel Generative Entity Recognition (GER) framework, which given an input image learns to auto-regressively decode a semantic and discriminative ``code'' identifying the target entity. Our experiments demonstrate the efficacy of this GER paradigm, showcasing state-of-the-art performance on the challenging OVEN benchmark. GER surpasses strong captioning, dual-encoder, visual matching and hierarchical classification baselines, affirming its advantage in tackling the complexities of web-scale recognition.
翻译:本文探讨了网络级视觉实体识别问题,具体任务是将给定的查询图像映射至维基百科中约600万个现有实体之一。处理此类规模问题的典型方法之一是采用双编码器模型(如CLIP),将所有实体名称和查询图像嵌入到统一空间中,从而支持近似k近邻搜索。另一种思路是将图像描述模型重新用于直接生成给定图像对应的实体名称。与此不同,我们提出了一种新颖的生成式实体识别框架(GER),该框架能够根据输入图像以自回归方式解码出识别目标实体的语义化且具有判别性的"编码"。实验表明,GER范式具有显著有效性,在具有挑战性的OVEN基准测试中展现了最先进的性能。GER超越了强大的图像描述、双编码器、视觉匹配及层次化分类基线方法,验证了其在应对网络级识别复杂性方面的优势。