Many deep learning synthetic speech generation tools are readily available. The use of synthetic speech has caused financial fraud, impersonation of people, and misinformation to spread. For this reason forensic methods that can detect synthetic speech have been proposed. Existing methods often overfit on one dataset and their performance reduces substantially in practical scenarios such as detecting synthetic speech shared on social platforms. In this paper we propose, Patched Spectrogram Synthetic Speech Detection Transformer (PS3DT), a synthetic speech detector that converts a time domain speech signal to a mel-spectrogram and processes it in patches using a transformer neural network. We evaluate the detection performance of PS3DT on ASVspoof2019 dataset. Our experiments show that PS3DT performs well on ASVspoof2019 dataset compared to other approaches using spectrogram for synthetic speech detection. We also investigate generalization performance of PS3DT on In-the-Wild dataset. PS3DT generalizes well than several existing methods on detecting synthetic speech from an out-of-distribution dataset. We also evaluate robustness of PS3DT to detect telephone quality synthetic speech and synthetic speech shared on social platforms (compressed speech). PS3DT is robust to compression and can detect telephone quality synthetic speech better than several existing methods.
翻译:许多深度学习合成语音生成工具已广泛可用。合成语音的使用导致了金融欺诈、身份冒充和虚假信息的传播。为此,研究者提出了能够检测合成语音的取证方法。现有方法往往在单一数据集上过拟合,在检测社交平台共享的合成语音等实际场景中性能显著下降。本文提出了一种补丁频谱图合成语音检测变压器(PS3DT),该检测器将时域语音信号转换为梅尔频谱图,并利用变压器神经网络以补丁形式进行处理。我们在ASVspoof2019数据集上评估了PS3DT的检测性能。实验表明,与其他基于频谱图的合成语音检测方法相比,PS3DT在ASVspoof2019数据集上表现优异。同时,我们在In-the-Wild数据集上考察了PS3DT的泛化性能。对于分布外数据集的合成语音检测,PS3DT优于现有若干方法。我们还评估了PS3DT检测电话质量合成语音和社交平台共享合成语音(压缩语音)的鲁棒性。PS3DT对压缩具有鲁棒性,且检测电话质量合成语音的能力优于多种现有方法。