Human-annotated attributes serve as powerful semantic embeddings in zero-shot learning. However, their annotation process is labor-intensive and needs expert supervision. Current unsupervised semantic embeddings, i.e., word embeddings, enable knowledge transfer between classes. However, word embeddings do not always reflect visual similarities and result in inferior zero-shot performance. We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning, without requiring any human annotation. Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity, and further imposes their class discrimination and semantic relatedness. To associate these clusters with previously unseen classes, we use external knowledge, e.g., word embeddings and propose a novel class relation discovery module. Through quantitative and qualitative evaluation, we demonstrate that our model discovers semantic embeddings that model the visual properties of both seen and unseen classes. Furthermore, we demonstrate on three benchmarks that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
翻译:人工标注的属性在零样本学习中作为强大的语义嵌入特征,但其标注过程劳动密集且需要专家监督。当前的無监督语义嵌入(即词嵌入)虽能实现类别间的知识迁移,但词嵌入不一定能反映视觉相似性,导致零样本性能不佳。我们提出无需任何人工标注,即可为零样本学习发现包含判别性视觉属性的语义嵌入方法。该模型根据视觉相似性将可见类图像集划分为局部图像区域簇,并进一步施加类别判别性与语义关联性约束。为将这些簇与未见类别关联,我们利用外部知识(如词嵌入)并设计新型类别关系发现模块。通过定量与定性评估,我们证明该模型发现的语义嵌入能建模可见与未见类别的视觉属性。进一步在三个基准数据集上的实验表明,相较于词嵌入,我们的视觉化语义嵌入能显著提升各类零样本学习模型的性能表现。