Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task.
翻译:复合图像检索旨在寻找与给定多模态用户查询(包含参考图像和文本对)最匹配的图像。现有方法通常预先计算整个语料库中所有图像的嵌入向量,并在测试时将这些向量与根据查询文本修改后的参考图像嵌入进行比较。这种流程在测试时非常高效,因为可以通过快速向量距离计算来评估候选图像,但仅通过简短文本描述来修改参考图像嵌入存在困难,尤其当该过程与潜在候选图像无关时。另一种替代方案是允许查询与每个候选图像之间进行交互(即参考-文本-候选三联体),并从整个集合中选出最佳匹配。尽管这种方法更具判别性,但在大规模数据集上其计算成本过高,因为无法再预先计算候选图像的嵌入向量。我们提出结合两种方案的优点,采用两阶段模型。第一阶段采用传统向量距离度量,对候选集进行快速剪枝;第二阶段则使用双编码器架构,有效关注输入的三联体(参考图像-文本-候选图像)并对候选进行重新排序。两个阶段均采用视觉-语言预训练网络,该网络已被证明对各种下游任务有益。我们的方法在标准基准测试中持续优于现有最先进的方法。