Compositionality, the ability to combine existing concepts and generalize towards novel compositions, is a key functionality for intelligent entities. Here, we study the problem of Compositional Zero-Shot Learning (CZSL), which aims at recognizing novel attribute-object compositions. Recent approaches build their systems on top of large-scale Vision-Language Pre-trained (VLP) models, e.g. CLIP, and observe significant improvements. However, these methods treat CLIP as a black box and focus on pre- and post-CLIP operations. Here, we propose to dive deep into the architecture and insert adapters, a parameter-efficient technique proven to be effective among large language models, to each CLIP encoder layer. We further equip adapters with concept awareness so that concept-specific features of "object", "attribute" and "composition" can be extracted. We name our method CAILA, Concept-Aware Intra-Layer Adapters. Quantitative evaluations performed on three popular CZSL datasets, MIT-States, C-GQA, and UT-Zappos, reveal that CAILA achieves double-digit relative improvements against the current state-of-the-art on all benchmarks.
翻译:组合性,即组合现有概念并泛化至新组合的能力,是智能实体的关键功能。本文研究组合零样本学习(CZSL)问题,旨在识别新颖的属性-对象组合。近期方法在大规模视觉-语言预训练(VLP)模型(如CLIP)基础上构建系统,并取得显著改进。然而,这些方法将CLIP视为黑箱,聚焦于CLIP前后的操作。本文提出深入架构内部,将适配器(一种在大语言模型中被证明有效的参数高效技术)插入每个CLIP编码器层。我们进一步赋予适配器概念感知能力,以提取“对象”、“属性”和“组合”的概念特定特征。我们将该方法命名为CAILA,即概念感知层内适配器。在三个主流CZSL数据集(MIT-States、C-GQA和UT-Zappos)上的定量评估显示,CAILA在所有基准测试中均实现相较于当前最优方法的两位数的相对改进。