An Integral Projection-based Semantic Autoencoder for Zero-Shot Learning

Zero-shot Learning (ZSL) classification categorizes or predicts classes (labels) that are not included in the training set (unseen classes). Recent works proposed different semantic autoencoder (SAE) models where the encoder embeds a visual feature vector space into the semantic space and the decoder reconstructs the original visual feature space. The objective is to learn the embedding by leveraging a source data distribution, which can be applied effectively to a different but related target data distribution. Such embedding-based methods are prone to domain shift problems and are vulnerable to biases. We propose an integral projection-based semantic autoencoder (IP-SAE) where an encoder projects a visual feature space concatenated with the semantic space into a latent representation space. We force the decoder to reconstruct the visual-semantic data space. Due to this constraint, the visual-semantic projection function preserves the discriminatory data included inside the original visual feature space. The enriched projection forces a more precise reconstitution of the visual feature space invariant to the domain manifold. Consequently, the learned projection function is less domain-specific and alleviates the domain shift problem. Our proposed IP-SAE model consolidates a symmetric transformation function for embedding and projection, and thus, it provides transparency for interpreting generative applications in ZSL. Therefore, in addition to outperforming state-of-the-art methods considering four benchmark datasets, our analytical approach allows us to investigate distinct characteristics of generative-based methods in the unique context of zero-shot inference.

翻译：零样本学习（ZSL）分类可对训练集中未包含的类别（未见类）进行分类或预测。近期研究提出了不同的语义自编码器（SAE）模型，其中编码器将视觉特征向量空间嵌入到语义空间，解码器重建原始视觉特征空间，其目标是通过利用源数据分布学习嵌入，从而有效应用于不同但相关的目标数据分布。此类基于嵌入的方法易受域偏移问题影响且对偏差敏感。我们提出基于积分投影的语义自编码器（IP-SAE），其中编码器将视觉特征空间与语义空间拼接后投影至潜在表示空间。我们强制解码器重建视觉-语义数据空间。该约束使得视觉-语义投影函数能够保留原始视觉特征空间中的判别性数据。增强的投影实现了对域流形不变的视觉特征空间更精确的重构。因此，所学习的投影函数域特异性降低，有效缓解了域偏移问题。所提出的IP-SAE模型统一了嵌入与投影的对称变换函数，从而为ZSL中生成式应用的解释提供了透明度。因此，除了在四个基准数据集上超越现有最优方法外，我们的分析方法还能在零样本推理的独特背景下深入探究生成式方法的独特特性。