Region-Based Optimization in Continual Learning for Audio Deepfake Detection

Rapid advancements in speech synthesis and voice conversion bring convenience but also new security risks, creating an urgent need for effective audio deepfake detection. Although current models perform well, their effectiveness diminishes when confronted with the diverse and evolving nature of real-world deepfakes. To address this issue, we propose a continual learning method named Region-Based Optimization (RegO) for audio deepfake detection. Specifically, we use the Fisher information matrix to measure important neuron regions for real and fake audio detection, dividing them into four regions. First, we directly fine-tune the less important regions to quickly adapt to new tasks. Next, we apply gradient optimization in parallel for regions important only to real audio detection, and in orthogonal directions for regions important only to fake audio detection. For regions that are important to both, we use sample proportion-based adaptive gradient optimization. This region-adaptive optimization ensures an appropriate trade-off between memory stability and learning plasticity. Additionally, to address the increase of redundant neurons from old tasks, we further introduce the Ebbinghaus forgetting mechanism to release them, thereby promoting the capability of the model to learn more generalized discriminative features. Experimental results show our method achieves a 21.3% improvement in EER over the state-of-the-art continual learning approach RWM for audio deepfake detection. Moreover, the effectiveness of RegO extends beyond the audio deepfake detection domain, showing potential significance in other tasks, such as image recognition. The code is available at https://github.com/cyjie429/RegO

翻译：语音合成与语音转换技术的快速发展在带来便利的同时也引发了新的安全风险，使得高效音频深度伪造检测的需求日益迫切。尽管现有模型表现良好，但在面对现实世界中多样且不断演化的深度伪造音频时，其检测效能会显著下降。为解决这一问题，我们提出了一种用于音频深度伪造检测的持续学习方法——基于区域的优化（RegO）。具体而言，我们使用费舍尔信息矩阵来度量真假音频检测中重要的神经元区域，并将其划分为四个区域。首先，我们直接对重要性较低的区域进行微调，以快速适应新任务。接着，对于仅对真实音频检测重要的区域，我们采用并行梯度优化；对于仅对伪造音频检测重要的区域，则在正交方向上进行优化。对于对两者皆重要的区域，我们采用基于样本比例的自适应梯度优化。这种区域自适应优化确保了记忆稳定性与学习可塑性之间的适当权衡。此外，为应对旧任务带来的冗余神经元增加问题，我们进一步引入艾宾浩斯遗忘机制以释放这些神经元，从而提升模型学习更具泛化性的判别特征的能力。实验结果表明，在音频深度伪造检测任务中，我们的方法相比最先进的持续学习方法RWM在等错误率上实现了21.3%的提升。此外，RegO的有效性不仅限于音频深度伪造检测领域，在其他任务（如图像识别）中也显示出潜在的重要意义。代码已发布于 https://github.com/cyjie429/RegO