In this paper, we present LookBench (We use the term "look" to reflect retrieval that mirrors how people shop -- finding the exact item, a close substitute, or a visually consistent alternative.), a live, holistic and challenging benchmark for fashion image retrieval in real e-commerce settings. LookBench includes both recent product images sourced from live websites and AI-generated fashion images, reflecting contemporary trends and use cases. Each test sample is time-stamped and we intend to update the benchmark periodically, enabling contamination-aware evaluation aligned with declared training cutoffs. Grounded in our fine-grained attribute taxonomy, LookBench covers single-item and outfit-level retrieval across. Our experiments reveal that LookBench poses a significant challenge on strong baselines, with many models achieving below $60\%$ Recall@1. Our proprietary model achieves the best performance on LookBench, and we release an open-source counterpart that ranks second, with both models attaining state-of-the-art results on legacy Fashion200K evaluations. LookBench is designed to be updated semi-annually with new test samples and progressively harder task variants, providing a durable measure of progress. We publicly release our leaderboard, dataset, evaluation code, and trained models.
翻译:本文提出LookBench(我们使用"look"这一术语来体现符合人们购物方式的检索——寻找完全相同的商品、近似替代品或视觉一致的替代方案),这是一个面向真实电商场景的实时、综合且具有挑战性的时尚图像检索基准。LookBench既包含从实时网站获取的最新商品图像,也涵盖AI生成的时尚图像,反映了当代趋势和实际用例。每个测试样本均带有时间戳,我们计划定期更新基准,从而实现与声明的训练截止时间相一致的防污染评估。基于我们细粒度的属性分类体系,LookBench涵盖单品级和套装级的检索任务。实验表明,LookBench对现有强基线模型构成显著挑战,多数模型的Recall@1指标低于$60\%$。我们的专有模型在LookBench上取得了最佳性能,同时我们发布了排名第二的开源对应模型,两种模型均在传统Fashion200K评估中达到了最先进水平。LookBench设计为每半年更新测试样本并逐步增加任务变体难度,以提供可持续的进展度量标准。我们公开发布了排行榜、数据集、评估代码及训练好的模型。