Albeit progress has been made in Composed Image Retrieval (CIR), we empirically find that a certain percentage of failure retrieval results are not consistent with their relative captions. To address this issue, this work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR. The resulting VQA4CIR is a post-processing approach and can be directly plugged into existing CIR methods. Given the top-C retrieved images by a CIR method, VQA4CIR aims to decrease the adverse effect of the failure retrieval results being inconsistent with the relative caption. To find the retrieved images inconsistent with the relative caption, we resort to the "QA generation to VQA" self-verification pipeline. For QA generation, we suggest fine-tuning LLM (e.g., LLaMA) to generate several pairs of questions and answers from each relative caption. We then fine-tune LVLM (e.g., LLaVA) to obtain the VQA model. By feeding the retrieved image and question to the VQA model, one can find the images inconsistent with relative caption when the answer by VQA is inconsistent with the answer in the QA pair. Consequently, the CIR performance can be boosted by modifying the ranks of inconsistently retrieved images. Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets.
翻译:尽管组合图像检索(CIR)已取得进展,但我们的实证研究发现,一定比例的失败检索结果与其对应的描述性文本不一致。为解决这一问题,本工作从视觉问答(VQA)角度出发,提出了一种提升CIR性能的方法。由此产生的VQA4CIR是一种后处理策略,可直接嵌入现有CIR方法中。针对CIR方法检索到的前C个图片,VQA4CIR旨在降低与描述性文本不一致的失败检索结果带来的负面影响。为识别与描述性文本不一致的检索图像,我们采用"问答生成→VQA验证"的自我校验流程。在问答生成阶段,我们建议微调大语言模型(如LLaMA),从每条描述性文本中生成多组问答对。随后微调大型视觉语言模型(如LLaVA)获取VQA模型。将检索图像与问题输入VQA模型后,当模型给出的答案与问答对中答案不一致时,即可识别出与描述性文本不一致的图像。通过调整不一致检索图像的排序位置,最终实现CIR性能提升。实验结果表明,本方法在CIRR和Fashion-IQ数据集上均优于当前最先进的CIR方法。