Text2Pic Swift: Enhancing Long-Text to Image Retrieval for Large-Scale Libraries

Text-to-image retrieval plays a crucial role across various applications, including digital libraries, e-commerce platforms, and multimedia databases, by enabling the search for images using text queries. Despite the advancements in Multimodal Large Language Models (MLLMs), which offer leading-edge performance, their applicability in large-scale, varied, and ambiguous retrieval scenarios is constrained by significant computational demands and the generation of injective embeddings. This paper introduces the Text2Pic Swift framework, tailored for efficient and robust retrieval of images corresponding to extensive textual descriptions in sizable datasets. The framework employs a two-tier approach: the initial Entity-based Ranking (ER) stage addresses the ambiguity inherent in lengthy text queries through a multiple-queries-to-multiple-targets strategy, effectively narrowing down potential candidates for subsequent analysis. Following this, the Summary-based Re-ranking (SR) stage further refines these selections based on concise query summaries. Additionally, we present a novel Decoupling-BEiT-3 encoder, specifically designed to tackle the challenges of ambiguous queries and to facilitate both stages of the retrieval process, thereby significantly improving computational efficiency via vector-based similarity assessments. Our evaluation, conducted on the AToMiC dataset, demonstrates that Text2Pic Swift outperforms current MLLMs by achieving up to an 11.06% increase in Recall@1000, alongside reductions in training and retrieval durations by 68.75% and 99.79%, respectively.

翻译：文本到图像检索通过文本查询实现图像搜索，在数字图书馆、电子商务平台和多媒体数据库等各类应用中发挥着关键作用。尽管多模态大语言模型（MLLMs）取得了显著进展并展现出前沿性能，但其在大规模、多样化和模糊检索场景中的适用性受到巨大计算需求和单射嵌入生成的限制。本文提出Text2Pic Swift框架，专为在大型数据集中实现与长篇文本描述对应图像的高效鲁棒检索而设计。该框架采用双层方法：初始实体排序（ER）阶段通过多查询到多目标的策略处理长文本查询固有的模糊性，有效缩小后续分析的候选范围。随后，基于摘要的重排序（SR）阶段依据简洁的查询摘要进一步优化这些候选结果。此外，我们提出新型解耦-BEiT-3编码器，专门用于解决模糊查询的挑战并支持检索过程的两个阶段，从而通过基于向量的相似度评估显著提升计算效率。我们在AToMiC数据集上进行的评估表明，Text2Pic Swift在Recall@1000指标上比当前MLLMs提升高达11.06%，同时将训练和检索时长分别减少了68.75%和99.79%。

相关内容

Swift

关注 101

苹果公司在 WWDC 2014 开幕 Keynote 上发布的全新编程语言，具有更多现代化特性，同时容易使用，定位是补充 Objective-C. > Swift is an innovative new programming language for Cocoa and Cocoa Touch. Writing code is interactive and fun, the syntax is concise yet expressive, and apps run lightning-fast. Swift is ready for your next iOS and OS X project — or for addition into your current app — because Swift code works side-by-side with Objective-C.

Swift - Apple Developer

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务