Generalized Zero-Shot Learning (GZSL) identifies unseen categories by knowledge transferred from the seen domain, relying on the intrinsic interactions between visual and semantic information. Prior works mainly localize regions corresponding to the sharing attributes. When various visual appearances correspond to the same attribute, the sharing attributes inevitably introduce semantic ambiguity, hampering the exploration of accurate semantic-visual interactions. In this paper, we deploy the dual semantic-visual transformer module (DSVTM) to progressively model the correspondences between attribute prototypes and visual features, constituting a progressive semantic-visual mutual adaption (PSVMA) network for semantic disambiguation and knowledge transferability improvement. Specifically, DSVTM devises an instance-motivated semantic encoder that learns instance-centric prototypes to adapt to different images, enabling the recast of the unmatched semantic-visual pair into the matched one. Then, a semantic-motivated instance decoder strengthens accurate cross-domain interactions between the matched pair for semantic-related instance adaption, encouraging the generation of unambiguous visual representations. Moreover, to mitigate the bias towards seen classes in GZSL, a debiasing loss is proposed to pursue response consistency between seen and unseen predictions. The PSVMA consistently yields superior performances against other state-of-the-art methods. Code will be available at: https://github.com/ManLiuCoder/PSVMA.
翻译:广义零样本学习通过从可见域传递知识来识别未见类别,这依赖于视觉与语义信息之间的内在交互。现有方法主要定位对应共享属性的区域。当不同视觉外观对应同一属性时,共享属性不可避免地引入语义歧义,阻碍了精确语义-视觉交互的探索。本文部署双语义-视觉变换器模块,渐进式建模属性原型与视觉特征之间的对应关系,构建渐进式语义-视觉相互自适应网络,以实现语义消歧和知识迁移性提升。具体而言,该模块设计了一种实例驱动的语义编码器,学习以实例为中心的原型以适应不同图像,从而将不匹配的语义-视觉对重构为匹配对。随后,一种语义驱动的实例解码器增强匹配对之间的跨域交互以实现语义相关实例自适应,促进生成无歧义的视觉表征。此外,为缓解GZSL中对可见类别的偏见,提出一种去偏损失函数以追求可见与未见预测之间的响应一致性。该方法在多项性能指标上持续优于现有最优方法。代码将于https://github.com/ManLiuCoder/PSVMA 公开。