Diffusion in Zero-Shot Learning for Environmental Audio

Zero-shot learning enables models to generalize to unseen classes by leveraging semantic information, bridging the gap between training and testing sets with non-overlapping classes. While much research has focused on zero-shot learning in computer vision, the application of these methods to environmental audio remains underexplored, with poor performance in existing studies. Generative methods, which have demonstrated success in computer vision, are notably absent from environmental audio zero-shot learning, where classification-based approaches dominate. To address this gap, this work investigates generative methods for zero-shot learning in environmental audio. Two successful generative models from computer vision are adapted: a cross-aligned and distribution-aligned variational autoencoder (CADA-VAE) and a leveraging invariant side generative adversarial network (LisGAN). Additionally, a novel diffusion model conditioned on class auxiliary data is introduced. The diffusion model generates synthetic data for unseen classes, which is combined with seen-class data to train a classifier. Experiments are conducted on two environmental audio datasets, ESC-50 and FSC22. Results show that the diffusion model significantly outperforms all baseline methods, achieving more than 25% higher accuracy on the ESC-50 test partition. This work establishes the diffusion model as a promising generative approach for zero-shot learning and introduces the first benchmark of generative methods for environmental audio zero-shot learning, providing a foundation for future research in the field. Code is provided at https://github.com/ysims/ZeroDiffusion for the novel ZeroDiffusion method.

翻译：零样本学习通过利用语义信息使模型能够泛化到未见类别，从而弥合训练集与测试集在类别不重叠情况下的差距。尽管已有大量研究关注计算机视觉领域的零样本学习，但这些方法在环境音频中的应用仍未被充分探索，现有研究中的性能表现普遍欠佳。生成式方法在计算机视觉中已取得显著成功，但在环境音频零样本学习领域却明显缺失，该领域目前仍以基于分类的方法为主导。为填补这一空白，本研究探索了生成式方法在环境音频零样本学习中的应用。我们借鉴了计算机视觉中两种成功的生成模型：交叉对齐与分布对齐变分自编码器（CADA-VAE）以及利用不变侧信息的生成对抗网络（LisGAN）。此外，我们提出了一种基于类别辅助数据条件化的新型扩散模型。该扩散模型为未见类别生成合成数据，这些数据与已见类别数据结合用于训练分类器。我们在两个环境音频数据集ESC-50和FSC22上进行了实验。结果表明，扩散模型显著优于所有基线方法，在ESC-50测试集上实现了超过25%的准确率提升。本研究确立了扩散模型作为零样本学习中一种具有前景的生成式方法，并首次构建了环境音频零样本学习生成式方法的基准，为该领域的未来研究奠定了基础。针对所提出的新型ZeroDiffusion方法，代码已在https://github.com/ysims/ZeroDiffusion 公开。