Recent advances in speech generation have significantly improved the naturalness of synthetic speech, making spoofing detection increasingly challenging. A key limitation of current anti-spoofing systems is their limited robustness to unseen synthesis methods. In this work, we transform a self-supervised speech representation model into a Mixture-of-Experts (MoE) architecture to improve generalization. Feed-forward blocks in selected encoder layers are replaced by multiple expert networks controlled by a layer-wise gating mechanism, allowing experts to capture complementary acoustic patterns while preserving the representations learned during self-supervised pretraining. We further analyze the architectural choices affecting the performance of this MoE conversion and investigate the activation behavior of the experts. The proposed approach is evaluated on 14 spoofing datasets and reduces the macro EER from 5.46% to 4.81%, corresponding to 11.9% relative improvement over the baseline.
翻译:近期语音生成技术的进展显著提升了合成语音的自然度,使得欺骗检测愈发困难。当前反欺骗系统的主要局限性在于对未知合成方法的鲁棒性不足。本研究将自监督语音表征模型转化为专家混合架构以提升泛化能力。通过在选定编码器层中用受层级门控机制控制的多个专家网络替代前馈模块,使得专家在保留自监督预训练表征的同时,能够捕获互补的声学模式。我们进一步分析了影响该专家混合转化性能的架构选择,并研究了专家的激活行为。所提方法在14个欺骗数据集上的评估中,将宏平均等错误率从5.46%降至4.81%,相较基线实现11.9%的相对提升。