Compositional Zero-Shot Learning (CZSL) aims to recognize novel \textit{state-object} compositions by leveraging the shared knowledge of their primitive components. Despite considerable progress, effectively calibrating the bias between semantically similar multimodal representations, as well as generalizing pre-trained knowledge to novel compositional contexts, remains an enduring challenge. In this paper, our interest is to revisit the conditional transport (CT) theory and its homology to the visual-semantics interaction in CZSL and further, propose a novel Trisets Consistency Alignment framework (dubbed TsCA) that well-addresses these issues. Concretely, we utilize three distinct yet semantically homologous sets, i.e., patches, primitives, and compositions, to construct pairwise CT costs to minimize their semantic discrepancies. To further ensure the consistency transfer within these sets, we implement a cycle-consistency constraint that refines the learning by guaranteeing the feature consistency of the self-mapping during transport flow, regardless of modality. Moreover, we extend the CT plans to an open-world setting, which enables the model to effectively filter out unfeasible pairs, thereby speeding up the inference as well as increasing the accuracy. Extensive experiments are conducted to verify the effectiveness of the proposed method.
翻译:组合零样本学习(CZSL)旨在通过利用其基本组件的共享知识来识别新颖的\textit{状态-物体}组合。尽管取得了显著进展,如何有效校准语义相似的多模态表示之间的偏差,以及将预训练知识泛化到新颖的组合情境中,仍然是一个持久的挑战。本文旨在重新审视条件传输(CT)理论及其与CZSL中视觉-语义交互的同源性,并进一步提出一个新颖的三集合一致性对齐框架(称为TsCA),以很好地解决这些问题。具体而言,我们利用三个不同但语义同源的集合——即图像块、基本组件和组合——来构建成对的条件传输成本,以最小化它们之间的语义差异。为了进一步确保这些集合内部的一致性传递,我们实现了一个循环一致性约束,该约束通过保证在传输流中自映射的特征一致性(无论模态如何)来细化学习过程。此外,我们将条件传输方案扩展到一个开放世界设置中,这使得模型能够有效过滤不可行的配对,从而加快推理速度并提高准确性。我们进行了广泛的实验以验证所提方法的有效性。