The rhythm of synthetic speech is usually too smooth, which causes that the fundamental frequency (F0) of synthetic speech is significantly different from that of real speech. It is expected that the F0 feature contains the discriminative information for the fake speech detection (FSD) task. In this paper, we propose a novel F0 subband for FSD. In addition, to effectively model the F0 subband so as to improve the performance of FSD, the spatial reconstructed local attention Res2Net (SR-LA Res2Net) is proposed. Specifically, Res2Net is used as a backbone network to obtain multiscale information, and enhanced with a spatial reconstruction mechanism to avoid losing important information when the channel group is constantly superimposed. In addition, local attention is designed to make the model focus on the local information of the F0 subband. Experimental results on the ASVspoof 2019 LA dataset show that our proposed method obtains an equal error rate (EER) of 0.47% and a minimum tandem detection cost function (min t-DCF) of 0.0159, achieving the state-of-the-art performance among all of the single systems.
翻译:合成语音的节奏通常过于平滑,导致其基频(F0)与真实语音存在显著差异。因此,F0特征有望包含用于伪造语音检测(FSD)任务的判别性信息。本文提出了一种新颖的F0子带用于FSD。此外,为有效建模F0子带以提升FSD性能,我们提出了空间重构局部注意力Res2Net(SR-LA Res2Net)。具体而言,采用Res2Net作为骨干网络以获取多尺度信息,并通过空间重构机制增强其能力,以避免通道组不断叠加时丢失重要信息。同时,设计局部注意力机制使模型聚焦于F0子带的局部信息。在ASVspoof 2019 LA数据集上的实验结果表明,所提方法实现了0.47%的等错误率(EER)和0.0159的最小串联检测代价函数(min t-DCF),在所有单系统中达到了最佳性能。