Traditional semantic image search methods aim to retrieve images that match the meaning of the text query. However, these methods typically search for objects on the whole image, without considering the localization of objects within the image. This paper presents an extension of existing object detection models for semantic image search that considers the semantic alignment between object proposals and text queries, with a focus on searching for objects within images. The proposed model uses a single feature extractor, a pre-trained Convolutional Neural Network, and a transformer encoder to encode the text query. Proposal-text alignment is performed using contrastive learning, producing a score for each proposal that reflects its semantic alignment with the text query. The Region Proposal Network (RPN) is used to generate object proposals, and the end-to-end training process allows for an efficient and effective solution for semantic image search. The proposed model was trained end-to-end, providing a promising solution for semantic image search that retrieves images that match the meaning of the text query and generates semantically relevant object proposals.
翻译:传统语义图像检索方法旨在检索与文本查询含义匹配的图像。然而,这些方法通常在整个图像上搜索目标,而未考虑图像内目标的位置定位。本文提出了一种扩展现有目标检测模型的方法,用于语义图像检索,该方法考虑了目标提案与文本查询之间的语义对齐,重点关注图像内目标的搜索。该模型采用单一特征提取器、预训练卷积神经网络以及Transformer编码器对文本查询进行编码。通过对比学习实现提案-文本对齐,为每个提案生成一个反映其与文本查询语义对齐程度的分数。区域提案网络(RPN)用于生成目标提案,端到端训练过程为语义图像检索提供了高效且有效的解决方案。所提出的模型经过端到端训练,为语义图像检索提供了一种有前景的解决方案,能够检索与文本查询含义匹配的图像,并生成语义相关的目标提案。