Since Text-to-Speech systems typically don't produce waveforms directly, recent spoof detection studies use resynthesized waveforms from vocoders and neural audio codecs to simulate an attacker. Unlike vocoders, which are specifically designed for speech synthesis, neural audio codecs were originally developed for compressing audio for storage and transmission. However, their ability to discretize speech also sparked interest in language-modeling-based speech synthesis. Owing to this dual functionality, codec resynthesized data may be labeled as either bonafide or spoof. So far, very little research has addressed this issue. In this study, we present a challenging extension of the ASVspoof 5 dataset constructed for this purpose. We examine how different labeling choices affect detection performance and provide insights into labeling strategies.
翻译:由于文本到语音系统通常不直接生成波形,近期的伪造检测研究使用来自声码器和神经音频编解码器的重合成波形来模拟攻击者。与专门为语音合成设计的声码器不同,神经音频编解码器最初是为音频存储和传输压缩而开发的。然而,其离散化语音的能力也引发了基于语言建模的语音合成的兴趣。由于这种双重功能,编解码器重合成的数据可能被标注为真实或伪造。迄今为止,很少有研究探讨这一问题。在本研究中,我们为此构建了ASVspoof 5数据集的挑战性扩展版本。我们研究了不同标注选择如何影响检测性能,并对标注策略提供了见解。