Bridging the SEA Gap: An Initial Benchmark for Neural Audio Codec-Synthesized Speech Deepfakes in South-East Asian Languages

Codecfakes (CFs) are a type of speech deepfakes generated through Audio Language Models (ALMs), with Neural Audio Codecs (NACs) forming the core mechanism for speech encoding and generation. CFs exhibit distributional characteristics that differ from vocoder-based deepfakes, causing detectors trained on vocoder data to generalize poorly to CFs detection. Although this has led to the development of CF detection benchmarks, existing resources are largely confined to English -- and to a limited extent Chinese -- leaving South-East Asian (SEA) languages unexplored. To bridge this gap, we introduce SEA-CF, the first large-scale benchmark for CF detection spanning multiple SEA languages, diverse speaker profiles, and a wide range of NAC architectures. SEA-CF is constructed by synthesizing publicly available real speech corpora. Our experiments show that state-of-the-art (SOTA) CF detectors trained on English-centric datasets fail to generalize to SEA speech due to language-specific phonetic structures, tonal variations, and rich prosodic diversity. We further conduct a comprehensive zero-shot and fine-tuned evaluation of recent SOTA ALMs on SEA-CF. Fine-tuning the ALMs improves performance, however, these are very large being impractical for real-world application due to their scale, particularly in low-resource and latency-constrained settings. To address this limitation, we propose a novel small-ALM, GARUDA tailored for CF detection, which delivers strong performance while remaining lightweight. Extensive evaluations demonstrate that the proposed Small-ALM outperforms strong end-to-end and ALM-based baselines, establishing a new, practical direction for robust CF detection in SEA languages and beyond.

翻译：Codecfake（CF）是通过音频语言模型（ALM）生成的一类语音深度伪造，其核心机制依赖神经音频编解码器（NAC）进行语音编码与生成。CF表现出与基于声码器的深度伪造不同的分布特征，导致基于声码器数据训练的检测器难以泛化至CF检测。尽管已有研究者开发了CF检测基准，但现有资源主要局限于英语（少量涉及中文），东南亚（SEA）语言领域仍为空白。为填补这一空白，我们提出SEA-CF——首个覆盖多语种、多元说话人特征及多种NAC架构的CF检测大规模基准。SEA-CF通过合成公开真实语音语料库构建。实验表明，基于英语数据集训练的最优（SOTA）CF检测器因语言特有的语音结构、声调变化及丰富韵律多样性，无法有效泛化至SEA语言。我们进一步在SEA-CF上对近期SOTA ALM进行了全面的零样本与微调评估。微调ALM虽能提升性能，但其模型规模庞大，在资源受限和延迟敏感场景下难以实用。为解决此局限，我们提出面向CF检测的新型小规模ALM——GARUDA，兼具强检测性能与轻量化特性。广泛评估表明，该小型ALM优于强端到端及基于ALM的基线模型，为SEA语言及其他场景下的稳健CF检测开辟了新的实用方向。