Cultural garments pose a unique challenge for visual retrieval systems, as their identity often depends on subtle structural and symbolic details that are poorly captured by standard AI models. We introduce VietFashion, a new benchmark for sketch-text composed image retrieval centered on the Ao Dai, a traditional Vietnamese garment. VietFashion enables designers and researchers to retrieve culturally meaningful outfits using a combination of hand-drawn sketches, which convey garment structure, and textual descriptions, which encode cultural semantics. The dataset is initialized with 650 sketches and expanded using generative models to produce over 21,000 photorealistic images with aligned captions. Textual prompts that describe detailed outfit attributes, which are extracted from fashion magazines to ensure authenticity and diversity. To better reflect the inherent ambiguity of design intent, VietFashion adopts a multi-target retrieval setting, where a single query may correspond to multiple valid results. We establish standardized evaluation protocols and benchmark state-of-the-art composed image retrieval methods. Experimental results reveal significant performance gaps in modeling fine-grained cultural semantics and multi-modal composition, positioning VietFashion as a challenging benchmark for fine-grained fashion retrieval. The dataset is publicly available at: https://hng0303.github.io/VietFashion.
翻译:民族服饰对视觉检索系统提出了独特挑战,其身份特征通常依赖于微妙的结构和符号细节,而标准人工智能模型难以充分捕捉这些信息。我们提出VietFashion,这是一个以越南传统服饰"奥黛"为核心的草图-文本组合图像检索新基准。VietFashion使设计师和研究人员能够通过结合手绘草图(传达服装结构)和文本描述(编码文化语义),检索具有文化意义的服饰。该数据集初始包含650张草图,并通过生成模型扩展,生成超过21,000张带有对齐标题的逼真图像。文本提示描述详细的服装属性,这些属性从时尚杂志中提取,以确保真实性和多样性。为更好反映设计意图固有的模糊性,VietFashion采用多目标检索设置,即单个查询可能对应多个有效结果。我们建立了标准化的评估协议,并对最先进的组合图像检索方法进行了基准测试。实验结果表明,在建模细粒度文化语义和多模态组合方面存在显著性能差距,这使VietFashion成为细粒度时装检索领域一个具有挑战性的基准。数据集公开于:https://hng0303.github.io/VietFashion。