First-shot (FS) unsupervised anomalous sound detection (ASD) is a brand-new task introduced in DCASE 2023 Challenge Task 2, where the anomalous sounds for the target machine types are unseen in training. Existing methods often rely on the availability of normal and abnormal sound data from the target machines. However, due to the lack of anomalous sound data for the target machine types, it becomes challenging when adapting the existing ASD methods to the first-shot task. In this paper, we propose a new framework for the first-shot unsupervised ASD, where metadata-assisted audio generation is used to estimate unknown anomalies, by utilising the available machine information (i.e., metadata and sound data) to fine-tune a text-to-audio generation model for generating the anomalous sounds that contain unique acoustic characteristics accounting for each different machine types. We then use the method of Time-Weighted Frequency domain audio Representation with Gaussian Mixture Model (TWFR-GMM) as the backbone to achieve the first-shot unsupervised ASD. Our proposed FS-TWFR-GMM method achieves competitive performance amongst top systems in DCASE 2023 Challenge Task 2, while requiring only 1% model parameters for detection, as validated in our experiments.
翻译:首次无监督异常声音检测(ASD)是DCASE 2023挑战赛任务2中引入的全新任务,其中目标机器类型的异常声音在训练中从未出现过。现有方法通常依赖目标机器正常与异常声音数据的可用性。然而,由于缺乏目标机器类型的异常声音数据,将现有ASD方法适配至首次任务具有挑战性。本文提出一种用于首次无监督ASD的新框架,通过利用可用的机器信息(即元数据和声音数据)微调文本到音频生成模型,生成包含不同机器类型独特声学特征的异常声音,从而使用元数据辅助音频生成技术估计未知异常。我们采用基于时间加权频域音频表示的高斯混合模型(TWFR-GMM)作为主干网络实现首次无监督ASD。实验验证表明,提出的FS-TWFR-GMM方法在DCASE 2023挑战赛任务2中达到与顶级系统相当的竞争力,且检测所需模型参数仅需1%。