The Contemporary Art of Image Search: Iterative User Intent Expansion via Vision-Language Model

Image search is an essential and user-friendly method to explore vast galleries of digital images. However, existing image search methods heavily rely on proximity measurements like tag matching or image similarity, requiring precise user inputs for satisfactory results.To meet the growing demand for a contemporary image search engine that enables accurate comprehension of users' search intentions, we introduce an innovative user intent expansion framework. Our framework leverages visual-language models to parse and compose multi-modal user inputs to provide more accurate and satisfying results. It comprises two-stage processes: 1) a parsing stage that incorporates a language parsing module with large language models to enhance the comprehension of textual inputs, along with a visual parsing module that integrates an interactive segmentation module to swiftly identify detailed visual elements within images; and 2) a logic composition stage that combines multiple user search intents into a unified logic expression for more sophisticated operations in complex searching scenarios. Moreover, the intent expansion framework enables users to perform flexible contextualized interactions with the search results to further specify or adjust their detailed search intents iteratively. We implemented the framework into an image search system for NFT (non-fungible token) search and conducted a user study to evaluate its usability and novel properties. The results indicate that the proposed framework significantly improves users' image search experience. Particularly the parsing and contextualized interactions prove useful in allowing users to express their search intents more accurately and engage in a more enjoyable iterative search experience.

翻译：图像搜索是探索海量数字图像画廊中不可或缺且用户友好的方法。然而，现有图像搜索方法过度依赖标签匹配或图像相似性等邻近度量方式，需要精确的用户输入才能获得理想结果。为满足当代图像搜索引擎需要精准理解用户搜索意图的需求，我们提出了一种创新的用户意图扩展框架。该框架利用视觉-语言模型解析并组合多模态用户输入，以提供更准确满意的结果。框架包含两阶段流程：1）解析阶段：集成基于大语言模型的文本解析模块以增强文本输入理解能力，以及集成交互式分割模块的视觉解析模块，用于快速识别图像中的细粒度视觉元素；2）逻辑组合阶段：将多个用户搜索意图整合为统一逻辑表达式，以应对复杂搜索场景中的高阶操作需求。此外，该意图扩展框架支持用户与搜索结果进行灵活的语境化交互，以迭代方式进一步明确或调整其具体搜索意图。我们将该框架部署至NFT（非同质化代币）图像搜索系统，并通过用户研究评估其可用性与创新特性。结果表明，所提框架显著改善了用户的图像搜索体验，其中解析模块与语境化交互功能特别有助于用户更精确地表达搜索意图，并享受更愉悦的迭代搜索过程。