The task of partial scene text retrieval involves localizing and searching for text instances that are the same or similar to a given query text from an image gallery. However, existing methods can only handle text-line instances, leaving the problem of searching for partial patches within these text-line instances unsolved due to a lack of patch annotations in the training data. To address this issue, we propose a network that can simultaneously retrieve both text-line instances and their partial patches. Our method embeds the two types of data (query text and scene text instances) into a shared feature space and measures their cross-modal similarities. To handle partial patches, our proposed approach adopts a Multiple Instance Learning (MIL) approach to learn their similarities with query text, without requiring extra annotations. However, constructing bags, which is a standard step of conventional MIL approaches, can introduce numerous noisy samples for training, and lower inference speed. To address this issue, we propose a Ranking MIL (RankMIL) approach to adaptively filter those noisy samples. Additionally, we present a Dynamic Partial Match Algorithm (DPMA) that can directly search for the target partial patch from a text-line instance during the inference stage, without requiring bags. This greatly improves the search efficiency and the performance of retrieving partial patches. The source code and dataset are available at https://github.com/lanfeng4659/PSTR.
翻译:部分场景文本检索任务涉及从图像库中定位并搜索与给定查询文本相同或相似的文本实例。然而,现有方法仅能处理文本行实例,由于训练数据中缺乏局部图像块标注,导致在这些文本行实例内部搜索局部图像块的问题尚未解决。为解决此问题,我们提出了一种能够同时检索文本行实例及其局部图像块的网络。我们的方法将两种类型的数据(查询文本与场景文本实例)嵌入到共享特征空间中,并度量其跨模态相似性。为处理局部图像块,所提方法采用多示例学习(MIL)框架来学习其与查询文本的相似性,且无需额外标注。然而,传统MIL方法中构建包的标准步骤会引入大量噪声样本用于训练,并降低推理速度。针对此问题,我们提出排序多示例学习(RankMIL)方法来自适应过滤这些噪声样本。此外,我们提出动态部分匹配算法(DPMA),该算法可在推理阶段直接从文本行实例中搜索目标局部图像块,无需构建包。这显著提升了搜索效率与局部图像块的检索性能。源代码与数据集已发布于 https://github.com/lanfeng4659/PSTR。