While an image is worth more than a thousand words, only a few provide crucial information for a given task and thus should be focused on. In light of this, ideal text-to-image (T2I) retrievers should prioritize specific visual attributes relevant to queries. To evaluate current retrievers on handling attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with 9,112 queries about diverse attributes of interest. We find that CLIP-like retrievers, which are widely adopted due to their efficiency and zero-shot ability, have poor and imbalanced performance, possibly because their image embeddings focus on global semantics and subjects while leaving out other details. Notably, we reveal that even recent Multimodal Large Language Model (MLLM)-based, stronger retrievers with a larger output dimension struggle with this limitation. Hence, we hypothesize that retrieving with general image embeddings is suboptimal for performing such queries. As a solution, we propose to use promptable image embeddings enabled by these multimodal retrievers, which boost performance by highlighting required attributes. Our pipeline for deriving such embeddings generalizes across query types, image pools, and base retriever architectures. To enhance real-world applicability, we offer two acceleration strategies: Pre-processing promptable embeddings and using linear approximations. We show that the former yields a 15% improvement in Recall@5 when prompts are predefined, while the latter achieves an 8% improvement when prompts are only available during inference.
翻译:尽管一幅图像蕴含的信息远胜千言万语,但仅有少数内容对特定任务至关重要,因而应予以重点关注。鉴于此,理想的文本到图像(T2I)检索模型应优先处理与查询相关的特定视觉属性。为评估现有检索模型处理面向属性查询的能力,我们构建了基于COCO的数据集COCO-Facet,包含9,112条针对不同关注属性的查询。研究发现,因高效性和零样本能力而被广泛采用的CLIP类检索模型表现不佳且不均衡,这可能源于其图像嵌入过度关注全局语义和主体,而忽略了其他细节。值得注意的是,我们发现即使近期基于多模态大语言模型(MLLM)、具有更大输出维度的更强检索模型,仍受限于此问题。因此,我们假设使用通用图像嵌入进行此类查询检索存在固有缺陷。作为解决方案,我们提出利用这些多模态检索模型生成的可提示图像嵌入,通过突出所需属性来提升性能。我们提出的嵌入生成流程可泛化至不同查询类型、图像池及基础检索架构。为增强实际应用性,我们提供两种加速策略:预计算可提示嵌入和使用线性近似。实验表明,当提示词预定义时,前者在Recall@5指标上实现15%的提升;而当提示词仅在推理阶段可用时,后者仍能获得8%的性能改进。