Existing generative models for unsupervised anomalous sound detection are limited by their inability to fully capture the complex feature distribution of normal sounds, while the potential of powerful diffusion models in this domain remains largely unexplored. To address this challenge, we propose a novel framework, TLDiffGAN, which consists of two complementary branches. One branch incorporates a latent diffusion model into the GAN generator for adversarial training, thereby making the discriminator's task more challenging and improving the quality of generated samples. The other branch leverages pretrained audio model encoders to extract features directly from raw audio waveforms for auxiliary discrimination. This framework effectively captures feature representations of normal sounds from both raw audio and Mel spectrograms. Moreover, we introduce a TMixup spectrogram augmentation technique to enhance sensitivity to subtle and localized temporal patterns that are often overlooked. Extensive experiments on the DCASE 2020 Challenge Task 2 dataset demonstrate the superior detection performance of TLDiffGAN, as well as its strong capability in anomalous time-frequency localization.
翻译:现有的无监督异常声音检测生成模型受限于其无法充分捕捉正常声音的复杂特征分布,而强大的扩散模型在该领域的潜力在很大程度上仍未得到探索。为应对这一挑战,我们提出了一种新颖的框架TLDiffGAN,该框架包含两个互补的分支。一个分支将潜在扩散模型集成到GAN生成器中进行对抗训练,从而使判别器的任务更具挑战性,并提高了生成样本的质量。另一个分支利用预训练的音频模型编码器直接从原始音频波形中提取特征以进行辅助判别。该框架有效地从原始音频和梅尔频谱图中捕捉了正常声音的特征表示。此外,我们引入了一种TMixup频谱图增强技术,以增强对常被忽略的细微和局部时序模式的敏感性。在DCASE 2020挑战赛任务2数据集上进行的大量实验证明了TLDiffGAN卓越的检测性能,以及其在异常时频定位方面的强大能力。