Diffusion-based speech generators are ubiquitous. These methods can generate very high quality synthetic speech and several recent incidents report their malicious use. To counter such misuse, synthetic speech detectors have been developed. Many of these detectors are trained on datasets which do not include diffusion-based synthesizers. In this paper, we demonstrate that existing detectors trained on one such dataset, ASVspoof2019, do not perform well in detecting synthetic speech from recent diffusion-based synthesizers. We propose the Diffusion-Based Synthetic Speech Dataset (DiffSSD), a dataset consisting of about 200 hours of labeled speech, including synthetic speech generated by 8 diffusion-based open-source and 2 commercial generators. We also examine the performance of existing synthetic speech detectors on DiffSSD in both closed-set and open-set scenarios. The results highlight the importance of this dataset in detecting synthetic speech generated from recent open-source and commercial speech generators.
翻译:基于扩散的语音生成器已无处不在。这些方法能够生成质量极高的合成语音,近期多起事件报告了其恶意使用案例。为应对此类滥用行为,合成语音检测器应运而生。现有许多检测器使用的训练数据集并未包含基于扩散的合成器。本文证明,在ASVspoof2019这类数据集上训练的现有检测器,在识别近期基于扩散的合成器生成的合成语音时表现欠佳。我们提出基于扩散的合成语音数据集(DiffSSD),该数据集包含约200小时的标注语音,涵盖8个开源扩散模型与2个商业生成器合成的语音。我们进一步评估了现有合成语音检测器在DiffSSD闭集与开集场景下的性能。实验结果凸显了本数据集在检测近期开源及商业语音生成器所产合成语音方面的重要价值。