Generalization in audio deepfake detection presents a significant challenge, with models trained on specific datasets often struggling to detect deepfakes generated under varying conditions and unknown algorithms. While collectively training a model using diverse datasets can enhance its generalization ability, it comes with high computational costs. To address this, we propose a neural collapse-based sampling approach applied to pre-trained models trained on distinct datasets to create a new training database. Using ASVspoof 2019 dataset as a proof-of-concept, we implement pre-trained models with Resnet and ConvNext architectures. Our approach demonstrates comparable generalization on unseen data while being computationally efficient, requiring less training data. Evaluation is conducted using the In-the-wild dataset.
翻译:音频深度伪造检测中的泛化能力是一项重大挑战,针对特定数据集训练的模型往往难以检测在不同条件下以及由未知算法生成的深度伪造音频。虽然使用多样化的数据集联合训练模型可以提升其泛化能力,但这种方法计算成本高昂。为此,我们提出一种基于神经坍缩的采样方法,应用于在不同数据集上预训练的模型,以创建新的训练数据库。以ASVspoof 2019数据集作为概念验证,我们利用Resnet和ConvNext架构实现了预训练模型。该方法在计算效率上具有优势(所需训练数据更少),同时在未知数据上展现出可比的泛化性能。评估工作采用In-the-wild数据集完成。