Tools to generate high quality synthetic speech signal that is perceptually indistinguishable from speech recorded from human speakers are easily available. Several approaches have been proposed for detecting synthetic speech. Many of these approaches use deep learning methods as a black box without providing reasoning for the decisions they make. This limits the interpretability of these approaches. In this paper, we propose Disentangled Spectrogram Variational Auto Encoder (DSVAE) which is a two staged trained variational autoencoder that processes spectrograms of speech using disentangled representation learning to generate interpretable representations of a speech signal for detecting synthetic speech. DSVAE also creates an activation map to highlight the spectrogram regions that discriminate synthetic and bona fide human speech signals. We evaluated the representations obtained from DSVAE using the ASVspoof2019 dataset. Our experimental results show high accuracy (>98%) on detecting synthetic speech from 6 known and 10 out of 11 unknown speech synthesizers. We also visualize the representation obtained from DSVAE for 17 different speech synthesizers and verify that they are indeed interpretable and discriminate bona fide and synthetic speech from each of the synthesizers.
翻译:能够生成与真人录制语音在感知上不可区分的合成语音信号的工具已易于获取。目前已有多种方法被提出用于检测合成语音,其中许多方法将深度学习作为"黑箱"使用,却未对其决策过程提供解释,这限制了方法的可解释性。本文提出解耦频谱图变分自编码器(DSVAE),这是一种两阶段训练的变分自编码器,通过解耦表征学习处理语音频谱图,生成用于检测合成语音的可解释表征。DSVAE还生成激活图以高亮区分合成语音与真实人声语音的频谱图区域。我们使用ASVspoof2019数据集评估了DSVAE获取的表征,实验结果显示,该方法对6种已知语音合成器及11种未知合成器中的10种生成的合成语音检测准确率超过98%。我们还对DSVAE在17种不同语音合成器上获取的表征进行了可视化,验证了这些表征确实具有可解释性,并能有效区分真实语音与各合成器生成的合成语音。