Image retrieval from contextual descriptions (IRCD) aims to identify an image within a set of minimally contrastive candidates based on linguistically complex text. Despite the success of VLMs, they still significantly lag behind human performance in IRCD. The main challenges lie in aligning key contextual cues in two modalities, where these subtle cues are concealed in tiny areas of multiple contrastive images and within the complex linguistics of textual descriptions. This motivates us to propose ContextBLIP, a simple yet effective method that relies on a doubly contextual alignment scheme for challenging IRCD. Specifically, 1) our model comprises a multi-scale adapter, a matching loss, and a text-guided masking loss. The adapter learns to capture fine-grained visual cues. The two losses enable iterative supervision for the adapter, gradually highlighting the focal patches of a single image to the key textual cues. We term such a way as intra-contextual alignment. 2) Then, ContextBLIP further employs an inter-context encoder to learn dependencies among candidates, facilitating alignment between the text to multiple images. We term this step as inter-contextual alignment. Consequently, the nuanced cues concealed in each modality can be effectively aligned. Experiments on two benchmarks show the superiority of our method. We observe that ContextBLIP can yield comparable results with GPT-4V, despite involving about 7,500 times fewer parameters.
翻译:基于上下文描述的图像检索旨在根据语言复杂的文本,从一组最小对比候选中识别出目标图像。尽管视觉语言模型已取得显著进展,但在此任务上的表现仍远逊于人类。主要挑战在于如何对齐两种模态中的关键上下文线索,这些细微线索既隐藏在多幅对比图像的微小区域中,也蕴含于文本描述的复杂语言结构内。为此,我们提出ContextBLIP——一种面向挑战性图像检索任务的简洁而有效的方法,其核心在于双重上下文对齐机制。具体而言:1)我们的模型包含多尺度适配器、匹配损失函数和文本引导掩码损失函数。适配器通过学习捕获细粒度视觉线索;两种损失函数为适配器提供迭代监督,逐步将单幅图像的关键区域与文本线索对齐。我们将此过程称为内部上下文对齐。2)ContextBLIP进一步采用跨上下文编码器学习候选图像间的依赖关系,促进文本与多幅图像的对齐。我们将此步骤称为跨上下文对齐。通过这种双重对齐机制,隐藏在各模态中的细微线索得以有效关联。在两个基准数据集上的实验证明了本方法的优越性。值得注意的是,ContextBLIP在参数量减少约7500倍的情况下,仍能取得与GPT-4V相媲美的性能。