Existing speech anti-spoofing benchmarks rely on a narrow set of public models, creating a substantial gap from real-world scenarios in which commercial systems employ diverse, often proprietary APIs. To address this issue, we introduce MultiAPI Spoof, a multi-API audio anti-spoofing dataset comprising about 230 hours of synthetic speech generated by 30 distinct APIs, including commercial services, open-source models, and online platforms. Furthermore, we propose Nes2Net-LA, a local-attention enhanced variant of Nes2Net that improves local context modeling and fine-grained spoofing feature extraction. Based on this dataset, we also define the API tracing task, enabling fine-grained attribution of spoofed audio to its generation source. Experiments show that Nes2Net-LA achieves state-of-the-art performance and offers superior robustness, particularly under diverse and unseen spoofing conditions. Code \footnote{https://github.com/XuepingZhang/MultiAPI-Spoof} and dataset \footnote{https://xuepingzhang.github.io/MultiAPI-Spoof-Dataset/} have been released.
翻译:现有语音反欺骗基准评测仅依赖少数公开模型,与真实场景中商业系统采用多样化、常具专利性的API存在显著差距。为解决该问题,我们提出MultiAPI Spoof——一个包含约230小时合成语音的多API音频反欺骗数据集,涵盖30种不同API(包括商业服务、开源模型及在线平台)。在此基础上,我们提出Nes2Net-LA——一种增强局部注意力的Nes2Net变体,可改善局部上下文建模与细粒度欺骗特征提取。基于该数据集,我们进一步定义了API溯源任务,实现对欺骗音频生成来源的细粒度归因。实验表明,Nes2Net-LA在多样化及未见欺骗条件下取得最优性能与卓越鲁棒性。代码\footnote{https://github.com/XuepingZhang/MultiAPI-Spoof}与数据集\footnote{https://xuepingzhang.github.io/MultiAPI-Spoof-Dataset/}已开源。