CodecFake+: Codec-Based Resynthesized Data as a Proxy for Detecting CodecFake Speech

With the rapid advancement of neural audio codecs, codec-based speech generation (CoSG) systems have become highly powerful. Unfortunately, CoSG also enables the creation of highly realistic deepfake speech, making it easier to mimic an individual's voice and spread misinformation. We refer to this emerging deepfake speech generated by CoSG systems as CodecFake. Detecting such CodecFake is an urgent challenge, yet most existing systems primarily focus on detecting fake speech generated by traditional speech synthesis models. In this paper, we introduce CodecFake+, a large-scale dataset designed to advance CodecFake detection. To our knowledge, CodecFake+ is the largest dataset encompassing the most diverse range of codec architectures. The training set is generated through re-synthesis using 31 publicly available open-source codec models, while the evaluation set includes web-sourced data from 17 advanced CoSG models. We also propose a comprehensive taxonomy that categorizes codecs by their root components: vector quantizer, auxiliary objectives, and decoder types. Our proposed dataset and taxonomy enable detailed analysis at multiple levels to discern the key factors for successful CodecFake detection. At the individual codec level, we validate the effectiveness of using codec re-synthesized speech (CoRS) as training data for large-scale CodecFake detection. At the taxonomy level, we show that detection performance is strongest when the re-synthesis model incorporates disentanglement auxiliary objectives or a frequency-domain decoder. Furthermore, from the perspective of using all the CoRS training data, we show that our proposed taxonomy can be used to select better training data for improving detection performance. Overall, we envision that CodecFake+ will be a valuable resource for both general and fine-grained exploration to develop better anti-spoofing models against CodecFake.

翻译：随着神经音频编解码器的快速发展，基于编解码器的语音生成系统已具备高度能力。然而，这类系统也能生成极为逼真的深度伪造语音，使得模仿他人声线并散布虚假信息更为便捷。我们将此类由编解码语音生成系统生成的新兴深度伪造语音定义为CodecFake。检测CodecFake是一项紧迫挑战，但现有系统大多聚焦于检测传统语音合成模型生成的伪造语音。本文提出CodecFake+大规模数据集，旨在推动CodecFake检测研究。据我们所知，CodecFake+是覆盖最多种类编解码器架构的最大规模数据集。其训练集通过31个开源编解码模型的重合成生成，评估集则包含来自17个先进CoSG模型的网络采集数据。我们同时提出了系统性分类体系，将编解码器按其核心组件划分为：向量量化器、辅助目标函数和解码器类型。该数据集与分类体系支持多层次精细分析，以甄别CodecFake检测的关键要素。在编解码器个体层面，我们验证了将编解码器重合成语音作为训练数据用于大规模CodecFake检测的有效性。在分类体系层面，研究表明当重合成模型采用解耦辅助目标函数或频域解码器时检测性能最优。此外，从利用全部CoRS训练数据的视角来看，我们所提出的分类体系可指导选择更优训练数据以提升检测性能。总体而言，我们预期CodecFake+将成为推动通用与细粒度探索、开发更优CodecFake反欺骗模型的重要资源。