The rhythm of bonafide speech is often difficult to replicate, which causes that the fundamental frequency (F0) of synthetic speech is significantly different from that of real speech. It is expected that the F0 feature contains the discriminative information for the fake speech detection (FSD) task. In this paper, we propose a novel F0 subband for FSD. In addition, to effectively model the F0 subband so as to improve the performance of FSD, the spatial reconstructed local attention Res2Net (SR-LA Res2Net) is proposed. Specifically, Res2Net is used as a backbone network to obtain multiscale information, and enhanced with a spatial reconstruction mechanism to avoid losing important information when the channel group is constantly superimposed. In addition, local attention is designed to make the model focus on the local information of the F0 subband. Experimental results on the ASVspoof 2019 LA dataset show that our proposed method obtains an equal error rate (EER) of 0.47% and a minimum tandem detection cost function (min t-DCF) of 0.0159, achieving the state-of-the-art performance among all of the single systems.
翻译:真实语音的韵律通常难以复现,这导致合成语音的基频(F0)与真实语音存在显著差异。因此,F0特征有望为伪造语音检测任务提供判别性信息。本文提出一种新颖的用于伪造语音检测的F0子带特征。此外,为有效建模该F0子带以提升检测性能,我们进一步提出空间重建局部注意力Res2Net。具体而言,以Res2Net作为骨干网络获取多尺度信息,并通过空间重建机制增强其特征表示能力,以避免通道组持续叠加时重要信息的丢失。同时,所设计的局部注意力机制使模型能够聚焦于F0子带的局部信息。在ASVspoof 2019 LA数据集上的实验结果表明,本方法获得了0.47%的等错误率与0.0159的最小串联检测代价函数,在所有单系统中实现了最优性能。