Zero-shot learning enables models to generalise to unseen classes by leveraging semantic information, bridging the gap between training and testing sets with non-overlapping classes. While much research has focused on zero-shot learning in computer vision, the application of these methods to environmental audio remains underexplored, with poor performance in existing studies. Generative methods, which have demonstrated success in computer vision, are notably absent from zero-shot environmental sound classification studies. To address this gap, this work investigates generative methods for zero-shot learning in environmental audio. Two successful generative models from computer vision are adapted: a cross-aligned and distribution-aligned variational autoencoder (CADA-VAE) and a leveraging invariant side generative adversarial network (LisGAN). Additionally, we introduced a novel diffusion model conditioned on class auxiliary data. Synthetic embeddings generated by the diffusion model are combined with seen class embeddings to train a classifier. Experiments are conducted on five environmental audio datasets, ESC-50, ARCA23K-FSD, FSC22, UrbanSound8k and TAU Urban Acoustics 2019, and one music classification dataset, GTZAN. Results show that the diffusion model outperforms all baseline methods on average across six audio datasets. This work establishes the diffusion model as a promising approach for zero-shot learning and introduces the first benchmark of generative methods for zero-shot environmental sound classification, providing a foundation for future research.
翻译:零样本学习通过利用语义信息使模型能够泛化到未见类别,从而弥合训练集和测试集之间无重叠类别的差距。尽管大量研究聚焦于计算机视觉中的零样本学习,但这些方法在环境音频中的应用仍探索不足,现有研究性能表现不佳。在计算机视觉中已取得成功的生成方法,在零样本环境声音分类研究中明显缺失。为填补这一空白,本工作研究了环境音频零样本学习中的生成方法。我们适配了两种来自计算机视觉的成功生成模型:跨对齐与分布对齐变分自编码器(CADA-VAE)以及利用不变性侧生成对抗网络(LisGAN)。此外,我们引入了一种基于类别辅助数据条件的新颖扩散模型。将扩散模型生成的合成嵌入与可见类嵌入相结合,用于训练分类器。我们在五个环境音频数据集(ESC-50、ARCA23K-FSD、FSC22、UrbanSound8k和TAU Urban Acoustics 2019)和一个音乐分类数据集(GTZAN)上进行了实验。结果表明,扩散模型在六个音频数据集上的平均性能优于所有基线方法。本工作确立了扩散模型作为零样本学习的一种有前景方法,并首次构建了用于零样本环境声音分类的生成方法基准,为未来研究奠定了基础。