Seeing Through Words: Controlling Visual Retrieval Quality with Language Models

Text-to-image retrieval is a fundamental task in vision-language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, rendering them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality. Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms that capture fine-grained visual attributes such as pose, scene, and aesthetics. We introduce a general framework that conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, so that query enrichment is not only semantically meaningful but also quality-aware. The resulting system provides three key advantages: 1) flexibility, it is compatible with any pretrained vision-language model (VLMs) without modification; 2) transparency, enriched queries are explicitly interpretable by users; and 3) controllability, enabling retrieval results to be steered toward user-preferred quality levels. Extensive experiments demonstrate that our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries. Our code is available at https://github.com/Jianglin954/QCQC.

翻译：文本到图像检索是视觉语言学习中的一项基础任务，然而在现实场景中，它常常受到简短且未充分指定的用户查询的挑战。此类查询通常仅有一到两个单词，导致其语义模糊，容易在不同视觉解释间产生冲突，并且缺乏对检索图像质量的显式控制。为解决这些问题，我们提出了一种质量可控检索的新范式，该范式通过添加上下文细节来丰富简短查询，同时融入明确的图像质量概念。我们的核心思想是利用生成式语言模型作为查询补全函数，将未充分指定的查询扩展为描述性形式，以捕捉姿态、场景和美学等细粒度视觉属性。我们引入了一个通用框架，该框架将查询补全过程与离散化的质量等级相关联，这些等级源自相关性和美学评分模型，从而使查询丰富不仅具有语义意义，而且具备质量感知能力。所构建的系统提供三个关键优势：1）灵活性，无需修改即可与任何预训练的视觉语言模型兼容；2）透明性，丰富的查询对用户而言是明确可解释的；3）可控性，能够引导检索结果朝向用户偏好的质量等级。大量实验表明，我们提出的方法显著改善了检索结果，并提供了有效的质量控制，弥合了现代视觉语言模型的表达能力与简短用户查询未充分指定特性之间的差距。我们的代码可在 https://github.com/Jianglin954/QCQC 获取。