Compositional Zero-Shot Learning (CZSL) aims to enable models to recognize novel compositions of visual states and objects that were absent during training. Existing methods predominantly focus on learning semantic representations of seen compositions but often fail to disentangle the independent features of states and objects in images, thereby limiting their ability to generalize to unseen compositions. To address this challenge, we propose Duplex, a novel dual-prototype learning method that integrates semantic and visual prototypes through a carefully designed dual-branch architecture, enabling effective representation learning for compositional tasks. Duplex utilizes a Graph Neural Network (GNN) to adaptively update visual prototypes, capturing complex interactions between states and objects. Additionally, it leverages the strong visual-semantic alignment of pre-trained Vision-Language Models (VLMs) and employs a multi-path architecture combined with prompt engineering to align image and text representations, ensuring robust generalization. Extensive experiments on three benchmark datasets demonstrate that Duplex outperforms state-of-the-art methods in both closed-world and open-world settings.
翻译:组合零样本学习(CZSL)旨在使模型能够识别训练过程中未出现过的视觉状态与物体的新颖组合。现有方法主要侧重于学习已见组合的语义表示,但往往未能有效解耦图像中状态与物体的独立特征,从而限制了其向未见组合的泛化能力。为应对这一挑战,我们提出Duplex,一种新颖的双原型学习方法,通过精心设计的双分支架构整合语义与视觉原型,从而实现对组合任务的有效表示学习。Duplex利用图神经网络(GNN)自适应地更新视觉原型,以捕捉状态与物体间的复杂交互。此外,该方法借助预训练视觉-语言模型(VLM)强大的视觉-语义对齐能力,并采用结合提示工程的多路径架构来对齐图像与文本表示,确保鲁棒的泛化性能。在三个基准数据集上的大量实验表明,Duplex在封闭世界与开放世界设定下均优于现有最先进方法。