Conditional Cross Attention Network for Multi-Space Embedding without Entanglement in Only a SINGLE Network

Many studies in vision tasks have aimed to create effective embedding spaces for single-label object prediction within an image. However, in reality, most objects possess multiple specific attributes, such as shape, color, and length, with each attribute composed of various classes. To apply models in real-world scenarios, it is essential to be able to distinguish between the granular components of an object. Conventional approaches to embedding multiple specific attributes into a single network often result in entanglement, where fine-grained features of each attribute cannot be identified separately. To address this problem, we propose a Conditional Cross-Attention Network that induces disentangled multi-space embeddings for various specific attributes with only a single backbone. Firstly, we employ a cross-attention mechanism to fuse and switch the information of conditions (specific attributes), and we demonstrate its effectiveness through a diverse visualization example. Secondly, we leverage the vision transformer for the first time to a fine-grained image retrieval task and present a simple yet effective framework compared to existing methods. Unlike previous studies where performance varied depending on the benchmark dataset, our proposed method achieved consistent state-of-the-art performance on the FashionAI, DARN, DeepFashion, and Zappos50K benchmark datasets.

翻译：在视觉任务的许多研究中，旨在为图像中的单标签目标预测创建有效的嵌入空间。然而，现实中大多数对象具有多个特定属性，如形状、颜色和长度，且每个属性由多种类别组成。为将模型应用于真实场景，必须能够区分对象的细粒度组成部分。传统方法将多个特定属性嵌入到单一网络中往往会导致纠缠，即无法单独识别每个属性的细粒度特征。为解决此问题，我们提出一种条件交叉注意力网络，仅通过单一骨干网络即可为各种特定属性诱导解纠缠的多空间嵌入。首先，我们采用交叉注意力机制来融合和切换条件（特定属性）的信息，并通过多样化的可视化示例证明其有效性。其次，我们首次将视觉变换器应用于细粒度图像检索任务，并提出一种相较于现有方法简单而有效的框架。与以往研究中性能随基准数据集变化的情况不同，我们提出的方法在FashionAI、DARN、DeepFashion和Zappos50K基准数据集上均取得了一致的最优性能。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日