Attribute-Aware Deep Hashing with Self-Consistency for Large-Scale Fine-Grained Image Retrieval

Our work focuses on tackling large-scale fine-grained image retrieval as ranking the images depicting the concept of interests (i.e., the same sub-category labels) highest based on the fine-grained details in the query. It is desirable to alleviate the challenges of both fine-grained nature of small inter-class variations with large intra-class variations and explosive growth of fine-grained data for such a practical task. In this paper, we propose attribute-aware hashing networks with self-consistency for generating attribute-aware hash codes to not only make the retrieval process efficient, but also establish explicit correspondences between hash codes and visual attributes. Specifically, based on the captured visual representations by attention, we develop an encoder-decoder structure network of a reconstruction task to unsupervisedly distill high-level attribute-specific vectors from the appearance-specific visual representations without attribute annotations. Our models are also equipped with a feature decorrelation constraint upon these attribute vectors to strengthen their representative abilities. Then, driven by preserving original entities' similarity, the required hash codes can be generated from these attribute-specific vectors and thus become attribute-aware. Furthermore, to combat simplicity bias in deep hashing, we consider the model design from the perspective of the self-consistency principle and propose to further enhance models' self-consistency by equipping an additional image reconstruction path. Comprehensive quantitative experiments under diverse empirical settings on six fine-grained retrieval datasets and two generic retrieval datasets show the superiority of our models over competing methods.

翻译：本文聚焦于大规模细粒度图像检索任务，旨在将描绘感兴趣概念（即相同子类别标签）的图像按查询的细粒度细节排序至最高位置。为应对该实际任务中类间差异小、类内差异大的细粒度特性以及细粒度数据爆炸式增长的双重挑战，我们提出了基于自一致性的属性感知哈希网络，用于生成属性感知哈希码，不仅提升检索效率，还建立了哈希码与视觉属性之间的显式对应关系。具体而言，基于注意力机制捕获的视觉表示，我们构建了重构任务的编码器-解码器网络结构，以无监督方式从外观特定的视觉表示中蒸馏出高层属性特定向量，且无需属性标注。模型还引入了针对这些属性向量的特征去相关约束，以增强其表征能力。随后，在保持原始实体相似性的驱动下，从这些属性特定向量生成所需的哈希码，从而使其具备属性感知特性。此外，为克服深度哈希中的简单性偏差，我们从自一致性原理的角度进行模型设计，并通过增加额外的图像重构路径进一步提升模型的自一致性。在六个细粒度检索数据集和两个通用检索数据集上开展的多维度定量实验表明，我们的模型相较对比方法具有显著优越性。