Text-to-image retrieval plays a crucial role across various applications, including digital libraries, e-commerce platforms, and multimedia databases, by enabling the search for images using text queries. Despite the advancements in Multimodal Large Language Models (MLLMs), which offer leading-edge performance, their applicability in large-scale, varied, and ambiguous retrieval scenarios is constrained by significant computational demands and the generation of injective embeddings. This paper introduces the Text2Pic Swift framework, tailored for efficient and robust retrieval of images corresponding to extensive textual descriptions in sizable datasets. The framework employs a two-tier approach: the initial Entity-based Ranking (ER) stage addresses the ambiguity inherent in lengthy text queries through a multiple-queries-to-multiple-targets strategy, effectively narrowing down potential candidates for subsequent analysis. Following this, the Summary-based Re-ranking (SR) stage further refines these selections based on concise query summaries. Additionally, we present a novel Decoupling-BEiT-3 encoder, specifically designed to tackle the challenges of ambiguous queries and to facilitate both stages of the retrieval process, thereby significantly improving computational efficiency via vector-based similarity assessments. Our evaluation, conducted on the AToMiC dataset, demonstrates that Text2Pic Swift outperforms current MLLMs by achieving up to an 11.06% increase in Recall@1000, alongside reductions in training and retrieval durations by 68.75% and 99.79%, respectively.
翻译:文本到图像检索通过文本查询实现图像搜索,在数字图书馆、电子商务平台和多媒体数据库等各类应用中发挥着关键作用。尽管多模态大语言模型(MLLMs)取得了显著进展并展现出前沿性能,但其在大规模、多样化和模糊检索场景中的适用性受到巨大计算需求和单射嵌入生成的限制。本文提出Text2Pic Swift框架,专为在大型数据集中实现与长篇文本描述对应图像的高效鲁棒检索而设计。该框架采用双层方法:初始实体排序(ER)阶段通过多查询到多目标的策略处理长文本查询固有的模糊性,有效缩小后续分析的候选范围。随后,基于摘要的重排序(SR)阶段依据简洁的查询摘要进一步优化这些候选结果。此外,我们提出新型解耦-BEiT-3编码器,专门用于解决模糊查询的挑战并支持检索过程的两个阶段,从而通过基于向量的相似度评估显著提升计算效率。我们在AToMiC数据集上进行的评估表明,Text2Pic Swift在Recall@1000指标上比当前MLLMs提升高达11.06%,同时将训练和检索时长分别减少了68.75%和99.79%。