Compositional zero-shot learning (CZSL) aims to recognize unseen compositions with prior knowledge of known primitives (attribute and object). Previous works for CZSL often suffer from grasping the contextuality between attribute and object, as well as the discriminability of visual features, and the long-tailed distribution of real-world compositional data. We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues. CoT employs object and attribute experts in distinctive manners to generate representative embeddings, using the visual network hierarchically. The object expert extracts representative object embeddings from the final layer in a bottom-up manner, while the attribute expert makes attribute embeddings in a top-down manner with a proposed object-guided attention module that models contextuality explicitly. To remedy biased prediction caused by imbalanced data distribution, we develop a simple minority attribute augmentation (MAA) that synthesizes virtual samples by mixing two images and oversampling minority attribute classes. Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL. We also demonstrate the effectiveness of CoT in improving visual discrimination and addressing the model bias from the imbalanced data distribution. The code is available at https://github.com/HanjaeKim98/CoT.
翻译:组合式零样本学习(CZSL)旨在利用已知基元(属性和对象)的先验知识来识别未见过的组合。以往的CZSL方法常难以把握属性与对象之间的上下文关联性、视觉特征的判别性,以及真实世界组合数据的长尾分布问题。我们提出一种简洁且可扩展的框架——组合变换器(CoT)来解决这些问题。CoT以独特方式利用对象专家和属性专家生成具有代表性的嵌入,并分层使用视觉网络。对象专家以自底向上的方式从最终层提取代表性对象嵌入,而属性专家则通过提出的对象引导注意力模块以自顶向下方式生成属性嵌入,该模块显式建模上下文关联性。为缓解数据分布不均衡导致的预测偏差,我们开发了一种简单的少数属性增强(MAA)方法,通过混合两幅图像并过采样少数属性类别来合成虚拟样本。我们的方法在MIT-States、C-GQA和VAW-CZSL等多个基准数据集上实现了最先进的性能。我们还证明了CoT在提升视觉判别性及解决数据分布不均衡导致的模型偏差方面的有效性。代码已开源:https://github.com/HanjaeKim98/CoT。