Generalized Zero-Shot Learning (GZSL) is a challenging task requiring accurate classification of both seen and unseen classes. Within this domain, Audio-visual GZSL emerges as an extremely exciting yet difficult task, given the inclusion of both visual and acoustic features as multi-modal inputs. Existing efforts in this field mostly utilize either embedding-based or generative-based methods. However, generative training is difficult and unstable, while embedding-based methods often encounter domain shift problem. Thus, we find it promising to integrate both methods into a unified framework to leverage their advantages while mitigating their respective disadvantages. Our study introduces a general framework employing out-of-distribution (OOD) detection, aiming to harness the strengths of both approaches. We first employ generative adversarial networks to synthesize unseen features, enabling the training of an OOD detector alongside classifiers for seen and unseen classes. This detector determines whether a test feature belongs to seen or unseen classes, followed by classification utilizing separate classifiers for each feature type. We test our framework on three popular audio-visual datasets and observe a significant improvement comparing to existing state-of-the-art works. Codes can be found in https://github.com/liuyuan-wen/AV-OOD-GZSL.
翻译:广义零样本学习(GZSL)是一项具有挑战性的任务,要求对已见类和未见类进行准确分类。在此领域中,音频-视觉广义零样本学习因其将视觉与声学特征作为多模态输入而成为一个极具吸引力但难度极高的任务。该领域的现有工作大多采用基于嵌入或基于生成的方法。然而,生成式训练困难且不稳定,而基于嵌入的方法常遭遇域偏移问题。因此,我们认为将两种方法整合到一个统一框架中以发挥各自优势并缓解其缺点具有良好前景。本研究引入了一种采用分布外(OOD)检测的通用框架,旨在结合两种方法的优势。我们首先利用生成对抗网络合成未见类特征,从而能够训练一个OOD检测器以及针对已见类和未见类的分类器。该检测器判定测试特征属于已见类还是未见类,随后利用针对各自特征类型的独立分类器进行分类。我们在三个流行的音频-视觉数据集上测试了该框架,与现有最先进工作相比,观察到了显著的性能提升。代码可在 https://github.com/liuyuan-wen/AV-OOD-GZSL 获取。