While generative modeling has been ubiquitous in natural language processing and computer vision, its application to image retrieval remains unexplored. In this paper, we recast image retrieval as a form of generative modeling by employing a sequence-to-sequence model, contributing to the current unified theme. Our framework, IRGen, is a unified model that enables end-to-end differentiable search, thus achieving superior performance thanks to direct optimization. While developing IRGen we tackle the key technical challenge of converting an image into quite a short sequence of semantic units in order to enable efficient and effective retrieval. Empirical experiments demonstrate that our model yields significant improvement over three commonly used benchmarks, for example, 22.9\% higher than the best baseline method in precision@10 on In-shop dataset with comparable recall@10 score.
翻译:尽管生成式建模在自然语言处理和计算机视觉中已广泛应用,但其在图像检索中的应用仍未得到探索。本文通过采用序列到序列模型,将图像检索重构为一种生成式建模形式,从而为当前的统一主题做出贡献。我们的框架IRGen是一种统一模型,能够实现端到端的可微检索,因此凭借直接优化取得了卓越性能。在开发IRGen的过程中,我们解决了将图像转换为极短语义单元序列这一关键技术挑战,以实现高效且有效的检索。实证实验表明,我们的模型在三个常用基准测试上取得了显著改进,例如,在In-shop数据集上,其精确率@10比最佳基线方法高出22.9%,同时召回率@10得分相当。