This work addresses composed image retrieval in the context of domain conversion, where the content of a query image is retrieved in the domain specified by the query text. We show that a strong vision-language model provides sufficient descriptive power without additional training. The query image is mapped to the text input space using textual inversion. Unlike common practice that invert in the continuous space of text tokens, we use the discrete word space via a nearest-neighbor search in a text vocabulary. With this inversion, the image is softly mapped across the vocabulary and is made more robust using retrieval-based augmentation. Database images are retrieved by a weighted ensemble of text queries combining mapped words with the domain text. Our method outperforms prior art by a large margin on standard and newly introduced benchmarks. Code: https://github.com/NikosEfth/freedom
翻译:本研究针对领域转换背景下的组合图像检索问题展开探索,该任务旨在根据查询文本指定的领域检索查询图像的内容。我们证明,强大的视觉-语言模型无需额外训练即可提供足够的描述能力。查询图像通过文本反演技术映射至文本输入空间。与通常在文本标记的连续空间进行反演的标准做法不同,我们通过在文本词汇表中进行最近邻搜索,利用离散词空间实现反演。通过这种反演方式,图像在词汇表中实现软映射,并借助基于检索的增强技术提升其鲁棒性。数据库图像的检索通过加权集成文本查询实现,该查询将映射词汇与领域文本相结合。在标准及新引入的基准测试中,本方法以显著优势超越了现有技术。代码地址:https://github.com/NikosEfth/freedom