While generative modeling has been ubiquitous in natural language processing and computer vision, its application to image retrieval remains unexplored. In this paper, we recast image retrieval as a form of generative modeling by employing a sequence-to-sequence model, contributing to the current unified theme. Our framework, IRGen, is a unified model that enables end-to-end differentiable search, thus achieving superior performance thanks to direct optimization. While developing IRGen we tackle the key technical challenge of converting an image into quite a short sequence of semantic units in order to enable efficient and effective retrieval. Empirical experiments demonstrate that our model yields significant improvement over three commonly used benchmarks, for example, 22.9\% higher than the best baseline method in precision@10 on In-shop dataset with comparable recall@10 score.
翻译:尽管生成式建模在自然语言处理和计算机视觉领域已广泛应用,但其在图像检索中的应用仍未被探索。本文通过采用序列到序列模型将图像检索重构为一种生成式建模形式,为当前统一化主题做出贡献。我们的框架IRGen是一个统一模型,能够实现端到端可微分搜索,从而通过直接优化获得卓越性能。在开发IRGen的过程中,我们攻克了将图像转换为极短语义单元序列的关键技术挑战,以实现高效准确的检索。实验结果表明,我们的模型在三个常用基准测试中均取得显著提升,例如在In-shop数据集上,精度@10较最佳基线方法提高22.9%,同时保持可比的召回率@10分数。