With the advancements in AI speech synthesis, it is easier than ever before to generate realistic audio in a target voice. One only needs a few seconds of reference audio from the target, quite literally putting words in the target person's mouth. This imposes a new set of forensics-related challenges on speech-based authentication systems, videoconferencing, and audio-visual broadcasting platforms, where we want to detect synthetic speech. At the same time, leveraging AI speech synthesis can enhance the different modes of communication through features such as low-bandwidth communication and audio enhancements - leading to ever-increasing legitimate use-cases of synthetic audio. In this case, we want to verify if the synthesized voice is actually spoken by the user. This will require a mechanism to verify whether a given synthetic audio is driven by an authorized identity, or not. We term this task audio avatar fingerprinting. As a step towards audio forensics in these new and emerging situations, we analyze and extend an off-the-shelf speaker verification model developed outside of forensics context for the task of fake speech detection and audio avatar fingerprinting, the first experimentation of its kind. Furthermore, we observe that no existing dataset allows for the novel task of verifying the authorized use of synthetic audio - a limitation which we address by introducing a new speech forensics dataset for this novel task.
翻译:随着人工智能语音合成技术的进步,生成目标语音的真实音频变得前所未有地简单。只需目标人物的几秒钟参考音频,就能将话语"塞入"目标人物口中。这给基于语音的身份验证系统、视频会议以及音视频广播平台带来了新的取证挑战——我们需要在这些场景中检测合成语音。与此同时,利用AI语音合成技术可以通过低带宽通信和音频增强等功能提升不同通信模式的质量,导致合成音频的合法使用场景不断增加。在这种情况下,我们需要验证合成语音是否确实由用户本人说出。这需要一种机制来判定给定的合成音频是否由授权身份驱动。我们将此任务称为"音频化身指纹识别"。作为新兴场景下音频取证领域的初步探索,我们分析并扩展了一个在非取证背景下开发的标准说话人验证模型,用于虚假语音检测和音频化身指纹识别任务,这是首次此类实验研究。此外,我们观察到现有数据集均不支持验证合成音频授权使用这一新任务——为此,我们引入了一个针对该新任务的语音取证数据集,填补了这一空白。