Detecting synthetic from real speech is increasingly crucial due to the risks of misinformation and identity impersonation. While various datasets for synthetic speech analysis have been developed, they often focus on specific areas, limiting their utility for comprehensive research. To fill this gap, we propose the Speech-Forensics dataset by extensively covering authentic, synthetic, and partially forged speech samples that include multiple segments synthesized by different high-quality algorithms. Moreover, we propose a TEmporal Speech LocalizaTion network, called TEST, aiming at simultaneously performing authenticity detection, multiple fake segments localization, and synthesis algorithms recognition, without any complex post-processing. TEST effectively integrates LSTM and Transformer to extract more powerful temporal speech representations and utilizes dense prediction on multi-scale pyramid features to estimate the synthetic spans. Our model achieves an average mAP of 83.55% and an EER of 5.25% at the utterance level. At the segment level, it attains an EER of 1.07% and a 92.19% F1 score. These results highlight the model's robust capability for a comprehensive analysis of synthetic speech, offering a promising avenue for future research and practical applications in this field.
翻译:由于虚假信息与身份冒充的风险日益加剧,区分合成语音与真实语音变得愈发关键。尽管已开发出多种用于合成语音分析的数据集,但它们通常聚焦于特定领域,限制了其在综合性研究中的应用。为填补这一空白,我们提出了Speech-Forensics数据集,广泛涵盖了真实、合成及部分伪造的语音样本,其中包含由多种高质量算法合成的多个语音片段。此外,我们提出了一种名为TEST(TEmporal Speech LocalizaTion network)的时序语音定位网络,旨在无需任何复杂后处理的情况下,同时执行真实性检测、多伪造片段定位及合成算法识别。TEST有效整合了LSTM与Transformer,以提取更具表现力的时序语音特征,并利用多尺度金字塔特征的密集预测来估计合成片段的范围。在语句级别,我们的模型实现了平均mAP 83.55%与EER 5.25%;在片段级别,则达到了EER 1.07%与F1分数92.19%的性能。这些结果凸显了该模型在合成语音综合分析方面的强大能力,为该领域未来的研究与实践应用提供了前景广阔的路径。