Detecting synthetic from real speech is increasingly crucial due to the risks of misinformation and identity impersonation. While various datasets for synthetic speech analysis have been developed, they often focus on specific areas, limiting their utility for comprehensive research. To fill this gap, we propose the Speech-Forensics dataset by extensively covering authentic, synthetic, and partially forged speech samples that include multiple segments synthesized by different high-quality algorithms. Moreover, we propose a TEmporal Speech LocalizaTion network, called TEST, aiming at simultaneously performing authenticity detection, multiple fake segments localization, and synthesis algorithms recognition, without any complex post-processing. TEST effectively integrates LSTM and Transformer to extract more powerful temporal speech representations and utilizes dense prediction on multi-scale pyramid features to estimate the synthetic spans. Our model achieves an average mAP of 83.55% and an EER of 5.25% at the utterance level. At the segment level, it attains an EER of 1.07% and a 92.19% F1 score. These results highlight the model's robust capability for a comprehensive analysis of synthetic speech, offering a promising avenue for future research and practical applications in this field.
翻译:由于虚假信息与身份冒充风险日益严峻,区分合成语音与真实语音变得愈发关键。尽管目前已开发出多种用于合成语音分析的数据集,但它们通常局限于特定领域,限制了其在综合性研究中的应用。为填补这一空白,我们提出Speech-Forensics数据集,广泛涵盖真实语音、合成语音及包含多个由不同高质量算法合成片段的局部伪造语音样本。此外,我们提出一种名为TEST的时序语音定位网络,旨在无需任何复杂后处理的情况下,同步实现真实性检测、多伪造片段定位及合成算法识别。TEST有效整合LSTM与Transformer以提取更具表现力的时序语音特征,并利用多尺度金字塔特征的密集预测来估计合成片段范围。我们的模型在语句层面达到83.55%的平均mAP与5.25%的EER。在片段层面,模型取得1.07%的EER与92.19%的F1分数。这些结果凸显了该模型在合成语音综合分析方面的强大能力,为该领域的未来研究与实践应用提供了可行路径。