Recent advancements in synthetic speech generation have led to the creation of forged audio data that are almost indistinguishable from real speech. This phenomenon poses a new challenge for the multimedia forensics community, as the misuse of synthetic media can potentially cause adverse consequences. Several methods have been proposed in the literature to mitigate potential risks and detect synthetic speech, mainly focusing on the analysis of the speech itself. However, recent studies have revealed that the most crucial frequency bands for detection lie in the highest ranges (above 6000 Hz), which do not include any speech content. In this work, we extensively explore this aspect and investigate whether synthetic speech detection can be performed by focusing only on the background component of the signal while disregarding its verbal content. Our findings indicate that the speech component is not the predominant factor in performing synthetic speech detection. These insights provide valuable guidance for the development of new synthetic speech detectors and their interpretability, together with some considerations on the existing work in the audio forensics field.
翻译:近年来,合成语音生成技术的进步使得伪造音频数据几乎与真实语音无法区分。这一现象给多媒体取证领域带来了新挑战,因为合成媒体的滥用可能引发不良后果。已有文献提出了多种方法以降低潜在风险并检测合成语音,这些方法主要聚焦于语音本身的分析。然而,近期研究揭示,检测最关键的频段位于高频范围(6000赫兹以上),而该频段不包含任何言语内容。本研究深入探讨了这一方面,探究是否可通过仅关注信号的背景成分(忽略其言语内容)来实现合成语音检测。我们的发现表明,语音成分并非合成语音检测中的主导因素。这些见解为开发新型合成语音检测器及其可解释性提供了宝贵指导,同时也引发了对音频取证领域现有工作的若干思考。