The goal of this paper is to enhance pretrained Vision Transformer (ViT) models for focus-oriented image retrieval with visual prompting. In real-world image retrieval scenarios, both query and database images often exhibit complexity, with multiple objects and intricate backgrounds. Users often want to retrieve images with specific object, which we define as the Focus-Oriented Image Retrieval (FOIR) task. While a standard image encoder can be employed to extract image features for similarity matching, it may not perform optimally in the multi-object-based FOIR task. This is because each image is represented by a single global feature vector. To overcome this, a prompt-based image retrieval solution is required. We propose an approach called Prompt-guided attention Head Selection (PHS) to leverage the head-wise potential of the multi-head attention mechanism in ViT in a promptable manner. PHS selects specific attention heads by matching their attention maps with user's visual prompts, such as a point, box, or segmentation. This empowers the model to focus on specific object of interest while preserving the surrounding visual context. Notably, PHS does not necessitate model re-training and avoids any image alteration. Experimental results show that PHS substantially improves performance on multiple datasets, offering a practical and training-free solution to enhance model performance in the FOIR task.
翻译:本文旨在通过视觉提示技术增强预训练视觉Transformer(ViT)模型在面向焦点的图像检索任务中的性能。在实际图像检索场景中,查询图像与数据库图像通常具有复杂性,包含多个对象与复杂背景。用户往往希望检索包含特定对象的图像,我们将此定义为面向焦点的图像检索任务。虽然可采用标准图像编码器提取图像特征进行相似度匹配,但在基于多对象的FOIR任务中,其性能可能未达最优,这是因为每张图像仅由单一全局特征向量表示。为克服此局限,需要基于提示的图像检索解决方案。我们提出一种称为提示引导的注意力头选择方法,以可提示化的方式利用ViT中多头注意力机制的头部级潜力。PHS通过将注意力头生成的注意力图与用户提供的视觉提示(如点、框或分割区域)进行匹配,从而选择特定的注意力头。这使得模型能够在保留周边视觉上下文的同时,聚焦于特定目标对象。值得注意的是,PHS无需重新训练模型,且避免对图像进行任何修改。实验结果表明,PHS在多个数据集上显著提升了检索性能,为增强模型在FOIR任务中的表现提供了一种无需训练的实用解决方案。