Many studies in vision tasks have aimed to create effective embedding spaces for single-label object prediction within an image. However, in reality, most objects possess multiple specific attributes, such as shape, color, and length, with each attribute composed of various classes. To apply models in real-world scenarios, it is essential to be able to distinguish between the granular components of an object. Conventional approaches to embedding multiple specific attributes into a single network often result in entanglement, where fine-grained features of each attribute cannot be identified separately. To address this problem, we propose a Conditional Cross-Attention Network that induces disentangled multi-space embeddings for various specific attributes with only a single backbone. Firstly, we employ a cross-attention mechanism to fuse and switch the information of conditions (specific attributes), and we demonstrate its effectiveness through a diverse visualization example. Secondly, we leverage the vision transformer for the first time to a fine-grained image retrieval task and present a simple yet effective framework compared to existing methods. Unlike previous studies where performance varied depending on the benchmark dataset, our proposed method achieved consistent state-of-the-art performance on the FashionAI, DARN, DeepFashion, and Zappos50K benchmark datasets.
翻译:在视觉任务的许多研究中,旨在为图像中的单标签目标预测创建有效的嵌入空间。然而,现实中大多数对象具有多个特定属性,如形状、颜色和长度,且每个属性由多种类别组成。为将模型应用于真实场景,必须能够区分对象的细粒度组成部分。传统方法将多个特定属性嵌入到单一网络中往往会导致纠缠,即无法单独识别每个属性的细粒度特征。为解决此问题,我们提出一种条件交叉注意力网络,仅通过单一骨干网络即可为各种特定属性诱导解纠缠的多空间嵌入。首先,我们采用交叉注意力机制来融合和切换条件(特定属性)的信息,并通过多样化的可视化示例证明其有效性。其次,我们首次将视觉变换器应用于细粒度图像检索任务,并提出一种相较于现有方法简单而有效的框架。与以往研究中性能随基准数据集变化的情况不同,我们提出的方法在FashionAI、DARN、DeepFashion和Zappos50K基准数据集上均取得了一致的最优性能。