Fashion image retrieval is a cornerstone of modern e-commerce systems. A unified framework that supports diverse query formats and search intentions is highly desired in practice. However, existing approaches focus on narrow retrieval tasks and do not fully capture such diversity. Therefore, in this work, we aim to develop a unified framework capable of handling diverse realistic fashion retrieval scenarios, achieving truly versatile fashion image retrieval. To establish a data foundation, we first introduce U-FIRE, a comprehensive benchmark that consolidates fragmented fashion datasets into a unified collection, supplemented by two manually curated datasets for testing generalization. Building upon this, we propose FashionLens, a unified framework based on Multimodal Large Language Models. To handle divergent matching objectives, we design a Proposal-Guided Spherical Query Calibrator that dynamically shifts query representations into task-aligned metric spaces via adaptive spherical linear interpolation. Additionally, to mitigate the optimization imbalance caused by varying task complexities and data scales, we develop a Gradient-Guided Adaptive Sampling strategy that automatically re-weights tasks based on realtime learning difficulty and the data scale prior. Experiments on U-FIRE show that FashionLens achieves state-of-the-art performance across diverse retrieval scenarios and generalizes robustly to unseen tasks. The data and code are publicly released at https://github.com/haokunwen/FashionLens.
翻译:时尚图像检索是现代电子商务系统的基石。能够支持多样化查询格式与搜索意图的统一框架在实际应用中备受期待。然而,现有方法聚焦于狭窄的检索任务,未能充分涵盖这种多样性。为此,本文旨在开发一个能够处理多样化真实时尚检索场景的统一框架,实现真正通用的时尚图像检索。为建立数据基础,我们首先引入U-FIRE(通用时尚图像检索评估基准)——一个将碎片化时尚数据集整合为统一集合的综合基准,并辅以两个人工标注数据集以测试泛化能力。在此基础上,我们提出FashionLens——基于多模态大语言模型的统一框架。为处理差异化的匹配目标,我们设计了提案引导的球面查询校准器,通过自适应球面线性插值动态将查询表示迁移至任务对齐的度量空间。此外,为缓解任务复杂度与数据规模差异导致的优化不平衡问题,我们开发了梯度引导的自适应采样策略,基于实时学习难度与数据规模先验自动重新加权任务。在U-FIRE上的实验表明,FashionLens在多种检索场景中均取得最先进性能,并对未见任务展现出强健的泛化能力。数据与代码已公开于https://github.com/haokunwen/FashionLens。