In this paper, we study the problem of Compositional Zero-Shot Learning (CZSL), which is to recognize novel attribute-object combinations with pre-existing concepts. Recent researchers focus on applying large-scale Vision-Language Pre-trained (VLP) models like CLIP with strong generalization ability. However, these methods treat the pre-trained model as a black box and focus on pre- and post-CLIP operations, which do not inherently mine the semantic concept between the layers inside CLIP. We propose to dive deep into the architecture and insert adapters, a parameter-efficient technique proven to be effective among large language models, into each CLIP encoder layer. We further equip adapters with concept awareness so that concept-specific features of "object", "attribute", and "composition" can be extracted. We assess our method on four popular CZSL datasets, MIT-States, C-GQA, UT-Zappos, and VAW-CZSL, which shows state-of-the-art performance compared to existing methods on all of them.
翻译:本文研究组合零样本学习(CZSL)问题,旨在利用既有概念识别新型属性-对象组合。近期研究集中于应用具有强泛化能力的大规模视觉-语言预训练(VLP)模型(如CLIP)。然而,这些方法将预训练模型视为黑箱,仅聚焦于CLIP前后的处理操作,未能从CLIP层间深度挖掘语义概念。我们提出深入模型架构,在每个CLIP编码器层中插入适配器——一种在大型语言模型中被证明有效的参数高效技术。我们进一步赋予适配器概念感知能力,从而提取"对象"、"属性"和"组合"的概念特定特征。我们在四个主流CZSL数据集(MIT-States、C-GQA、UT-Zappos和VAW-CZSL)上评估该方法,在所有数据集上均展现出相较于现有方法的最优性能。