Attribute-Guided Multi-Level Attention Network for Fine-Grained Fashion Retrieval

Fine-grained fashion retrieval searches for items that share a similar attribute with the query image. Most existing methods use a pre-trained feature extractor (e.g., ResNet 50) to capture image representations. However, a pre-trained feature backbone is typically trained for image classification and object detection, which are fundamentally different tasks from fine-grained fashion retrieval. Therefore, existing methods suffer from a feature gap problem when directly using the pre-trained backbone for fine-tuning. To solve this problem, we introduce an attribute-guided multi-level attention network (AG-MAN). Specifically, we first enhance the pre-trained feature extractor to capture multi-level image embedding, thereby enriching the low-level features within these representations. Then, we propose a classification scheme where images with the same attribute, albeit with different values, are categorized into the same class. This can further alleviate the feature gap problem by perturbing object-centric feature learning. Moreover, we propose an improved attribute-guided attention module for extracting more accurate attribute-specific representations. Our model consistently outperforms existing attention based methods when assessed on the FashionAI (62.8788% in MAP), DeepFashion (8.9804% in MAP), and Zappos50k datasets (93.32% in Prediction accuracy). Especially, ours improves the most typical ASENet_V2 model by 2.12%, 0.31%, and 0.78% points in FashionAI, DeepFashion, and Zappos50k datasets, respectively. The source code is available in https://github.com/Dr-LingXiao/AG-MAN.

翻译：细粒度时尚检索旨在搜索与查询图像共享相似属性的商品。现有方法大多使用预训练特征提取器（如ResNet 50）捕获图像表示。然而，预训练特征骨干网络通常针对图像分类和物体检测任务进行训练，这些任务与细粒度时尚检索存在本质差异。因此，现有方法在直接使用预训练骨干网络进行微调时会面临特征间隙问题。为解决该问题，我们提出属性引导的多层级注意力网络（AG-MAN）。具体而言，我们首先增强预训练特征提取器以捕获多层级图像嵌入，从而丰富这些表示中的低层特征。随后提出一种分类方案，将具有相同属性（即使属性值不同）的图像归为同一类别，通过干扰以物体为中心的特征学习进一步缓解特征间隙问题。此外，我们提出改进的属性引导注意力模块以提取更准确的属性特定表示。在FashionAI（MAP达62.8788%）、DeepFashion（MAP达8.9804%）和Zappos50k数据集（预测准确率达93.32%）上的评估表明，我们的模型持续优于现有基于注意力的方法。特别地，相较于最具代表性的ASENet_V2模型，我们在FashionAI、DeepFashion和Zappos50k数据集上分别提升2.12%、0.31%和0.78%。源代码已开源至https://github.com/Dr-LingXiao/AG-MAN。