An Integral Projection-based Semantic Autoencoder for Zero-Shot Learning

Zero-shot Learning (ZSL) classification categorizes or predicts classes (labels) that are not included in the training set (unseen classes). Recent works proposed different semantic autoencoder (SAE) models where the encoder embeds a visual feature vector space into the semantic space and the decoder reconstructs the original visual feature space. The objective is to learn the embedding by leveraging a source data distribution, which can be applied effectively to a different but related target data distribution. Such embedding-based methods are prone to domain shift problems and are vulnerable to biases. We propose an integral projection-based semantic autoencoder (IP-SAE) where an encoder projects a visual feature space concatenated with the semantic space into a latent representation space. We force the decoder to reconstruct the visual-semantic data space. Due to this constraint, the visual-semantic projection function preserves the discriminatory data included inside the original visual feature space. The enriched projection forces a more precise reconstitution of the visual feature space invariant to the domain manifold. Consequently, the learned projection function is less domain-specific and alleviates the domain shift problem. Our proposed IP-SAE model consolidates a symmetric transformation function for embedding and projection, and thus, it provides transparency for interpreting generative applications in ZSL. Therefore, in addition to outperforming state-of-the-art methods considering four benchmark datasets, our analytical approach allows us to investigate distinct characteristics of generative-based methods in the unique context of zero-shot inference.

翻译：零样本学习（ZSL）分类任务对训练集中未包含的类别（未见类别）进行归类或预测。近期研究提出了不同的语义自编码器（SAE）模型，其中编码器将视觉特征向量空间嵌入到语义空间，解码器重建原始视觉特征空间。其目标是通过利用源数据分布学习嵌入表示，该表示可有效应用于不同但相关的目标数据分布。此类基于嵌入的方法容易产生域偏移问题且易受偏差影响。我们提出一种基于积分投影的语义自编码器（IP-SAE），其中编码器将视觉特征空间与语义空间拼接后投影至潜在表示空间，并强制解码器重建视觉-语义数据空间。该约束使得视觉-语义投影函数能够保留原始视觉特征空间中包含的判别性数据。增强的投影迫使对视觉特征空间进行更精确的重构，使其对域流形具有不变性。因此，学习得到的投影函数域特异性更低，从而缓解了域偏移问题。我们提出的IP-SAE模型整合了用于嵌入与投影的对称变换函数，为ZSL中生成式应用的解释提供了透明度。因此，除了在四个基准数据集上超越现有最佳方法外，我们的分析方法还能在零样本推理的独特背景下探索生成式方法的不同特性。