Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task.
翻译:组合图像检索旨在搜索与给定多模态用户查询(包含参考图像和文本描述对)最匹配的图像。现有方法通常预先对整个语料库的图像进行嵌入编码,并在测试时将查询文本修正后的参考图像嵌入与这些预计算嵌入进行距离比较。这种流水线测试时效率较高(可利用快速向量距离评估候选集),但仅通过简短文本描述修正参考图像嵌入存在困难,尤其当该过程独立于潜在候选集时。另一种替代方案是允许查询与每个候选图像进行交互(即参考图像-文本-候选三元组),并从全集中筛选最优结果。尽管该方法更具判别性,但因其无法预计算候选嵌入,大规模数据集的计算代价将难以承受。本文提出结合两种方案优点的两阶段模型:第一阶段采用传统向量距离度量对候选集进行快速剪枝;第二阶段基于双编码器架构,通过有效关注输入三元组(参考图像、文本、候选图像)对候选集进行重排序。两个阶段均采用经视觉与语言预训练的网络模型,该技术已被证实在多种下游任务中效果显著。在标准基准测试中,本方法性能持续超越当前最优方案。