Recent advances in vision-language models (VLMs) have made significant progress in downstream tasks that require quantitative concepts such as facial age estimation and image quality assessment, enabling VLMs to explore applications like image ranking and retrieval. However, existing studies typically focus on the reasoning based on a single image and heavily depend on text prompting, limiting their ability to learn comprehensive understanding from multiple images. To address this, we propose an effective yet efficient approach that reframes the CLIP model into a learning-to-rank task and introduces a lightweight adapter to augment CLIP for text-guided image ranking. Specifically, our approach incorporates learnable prompts to adapt to new instructions for ranking purposes and an auxiliary branch with ranking-aware attention, leveraging text-conditioned visual differences for additional supervision in image ranking. Our ranking-aware adapter consistently outperforms fine-tuned CLIPs on various tasks and achieves competitive results compared to state-of-the-art models designed for specific tasks like facial age estimation and image quality assessment. Overall, our approach primarily focuses on ranking images with a single instruction, which provides a natural and generalized way of learning from visual differences across images, bypassing the need for extensive text prompts tailored to individual tasks. Code is available: https://github.com/uynaes/RankingAwareCLIP.
翻译:视觉语言模型(VLMs)的最新进展在需要量化概念的下游任务(如面部年龄估计和图像质量评估)中取得了显著进步,使得VLM能够探索图像排序与检索等应用。然而,现有研究通常侧重于基于单张图像的推理,并严重依赖文本提示,限制了其从多张图像中学习全面理解的能力。为解决此问题,我们提出了一种高效且有效的方法,将CLIP模型重新构建为排序学习任务,并引入轻量级适配器以增强CLIP在文本引导图像排序中的能力。具体而言,我们的方法结合了可学习的提示以适应排序目的的新指令,以及一个带有排序感知注意力的辅助分支,利用文本条件化的视觉差异为图像排序提供额外的监督。我们的排序感知适配器在各种任务上持续优于微调的CLIP模型,并在特定任务(如面部年龄估计和图像质量评估)上与最先进的模型相比取得了有竞争力的结果。总体而言,我们的方法主要专注于通过单一指令对图像进行排序,这提供了一种自然且通用的从图像间视觉差异中学习的方式,避免了为各任务定制大量文本提示的需要。代码已公开:https://github.com/uynaes/RankingAwareCLIP。