While deepfake speech detectors built on large self-supervised learning (SSL) models achieve high accuracy, employing standard ensemble fusion to further enhance robustness often results in oversized systems with diminishing returns. To address this, we propose an evolutionary multi-objective score fusion framework that jointly minimizes detection error and system complexity. We explore two encodings optimized by NSGA-II: binary-coded detector selection for score averaging and a real-valued scheme that optimizes detector weights for a weighted sum. Experiments on the ASVspoof 5 dataset with 36 SSL-based detectors show that the obtained Pareto fronts outperform simple averaging and logistic regression baselines. The real-valued variant achieves 2.37% EER (0.0684 minDCF) and identifies configurations that match state-of-the-art performance while significantly reducing system complexity, requiring only half the parameters. Our method also provides a diverse set of trade-off solutions, enabling deployment choices that balance accuracy and computational cost.
翻译:尽管基于大型自监督学习(SSL)模型的深度伪造语音检测器实现了高精度,但采用标准集成融合来进一步增强鲁棒性通常会导致系统规模过大且收益递减。针对这一问题,我们提出了一种进化式多目标分数融合框架,该框架能同时最小化检测误差和系统复杂度。我们探索了由NSGA-II优化的两种编码方式:用于分数平均的二值编码检测器选择方案,以及优化加权求和检测器权重的实值方案。在包含36个基于SSL的检测器的ASVspoof 5数据集上的实验表明,所获得的Pareto前沿优于简单的平均融合和逻辑回归基线方法。实值变体实现了2.37%的等错误率(EER)(0.0684 minDCF),并识别出与当前最佳性能相匹配、同时显著降低系统复杂度的配置,仅需一半参数。我们的方法还提供了多样化的权衡解决方案,使得能够在准确性和计算成本之间进行部署选择。