Under the flourishing development in performance, current image-text retrieval methods suffer from $N$-related time complexity, which hinders their application in practice. Targeting at efficiency improvement, this paper presents a simple and effective keyword-guided pre-screening framework for the image-text retrieval. Specifically, we convert the image and text data into the keywords and perform the keyword matching across modalities to exclude a large number of irrelevant gallery samples prior to the retrieval network. For the keyword prediction, we transfer it into a multi-label classification problem and propose a multi-task learning scheme by appending the multi-label classifiers to the image-text retrieval network to achieve a lightweight and high-performance keyword prediction. For the keyword matching, we introduce the inverted index in the search engine and create a win-win situation on both time and space complexities for the pre-screening. Extensive experiments on two widely-used datasets, i.e., Flickr30K and MS-COCO, verify the effectiveness of the proposed framework. The proposed framework equipped with only two embedding layers achieves $O(1)$ querying time complexity, while improving the retrieval efficiency and keeping its performance, when applied prior to the common image-text retrieval methods. Our code will be released.
翻译:在性能蓬勃发展的背景下,当前图像-文本检索方法受限于N相关的时间复杂度,阻碍了其实际应用。针对效率提升问题,本文提出了一种简单有效的关键词引导预筛选框架用于图像-文本检索。具体而言,我们将图像和文本数据转化为关键词,并在检索网络之前通过跨模态关键词匹配排除大量无关的候选样本。针对关键词预测,我们将其转化为多标签分类问题,并提出了一种多任务学习方案,通过为图像-文本检索网络附加多标签分类器来实现轻量级高性能的关键词预测。针对关键词匹配,我们引入搜索引擎中的倒排索引,在预筛选阶段实现了时间和空间复杂度的双赢。在两个广泛使用的数据集(Flickr30K和MS-COCO)上进行的大量实验验证了所提框架的有效性。该框架仅配备两个嵌入层即可实现O(1)的查询时间复杂度,同时当应用于常见图像-文本检索方法之前时,能够提升检索效率并保持其性能。我们的代码将对外发布。