Zero-shot learning (ZSL) enables the recognition of novel classes by leveraging semantic knowledge transfer from known to unknown categories. This knowledge, typically encapsulated in attribute descriptions, aids in identifying class-specific visual features, thus facilitating visual-semantic alignment and improving ZSL performance. However, real-world challenges such as distribution imbalances and attribute co-occurrence among instances often hinder the discernment of local variances in images, a problem exacerbated by the scarcity of fine-grained, region-specific attribute annotations. Moreover, the variability in visual presentation within categories can also skew attribute-category associations. In response, we propose a bidirectional cross-modal ZSL approach CREST. It begins by extracting representations for attribute and visual localization and employs Evidential Deep Learning (EDL) to measure underlying epistemic uncertainty, thereby enhancing the model's resilience against hard negatives. CREST incorporates dual learning pathways, focusing on both visual-category and attribute-category alignments, to ensure robust correlation between latent and observable spaces. Moreover, we introduce an uncertainty-informed cross-modal fusion technique to refine visual-attribute inference. Extensive experiments demonstrate our model's effectiveness and unique explainability across multiple datasets. Our code and data are available at: https://github.com/JethroJames/CREST.
翻译:零样本学习通过从已知类别向未知类别迁移语义知识来实现对新类别的识别。这些通常蕴含在属性描述中的知识有助于识别类别特定的视觉特征,从而促进视觉-语义对齐并提升零样本学习性能。然而,现实世界中存在的分布不均衡和实例间属性共现等挑战常阻碍对图像局部差异的辨识,而细粒度、区域级属性标注的稀缺又加剧了这一问题。此外,类别内视觉表现的差异性也可能扭曲属性-类别关联。为此,我们提出了一种双向跨模态零样本学习方法CREST。该方法首先提取属性和视觉定位的表征,并利用证据深度学习度量潜在的认知不确定性,从而增强模型对困难负样本的鲁棒性。CREST采用双学习路径,聚焦于视觉-类别和属性-类别对齐,以确保潜在空间和可观测空间之间的稳健关联。同时,我们引入了一种不确定性感知的跨模态融合技术来优化视觉-属性推理。大量实验证明了我们模型的有效性和在多数据集上独特的可解释性。我们的代码和数据见:https://github.com/JethroJames/CREST。