Existing fake audio detection systems perform well in in-domain testing, but still face many challenges in out-of-domain testing. This is due to the mismatch between the training and test data, as well as the poor generalizability of features extracted from limited views. To address this, we propose multi-view features for fake audio detection, which aim to capture more generalized features from prosodic, pronunciation, and wav2vec dimensions. Specifically, the phoneme duration features are extracted from a pre-trained model based on a large amount of speech data. For the pronunciation features, a Conformer-based phoneme recognition model is first trained, keeping the acoustic encoder part as a deeply embedded feature extractor. Furthermore, the prosodic and pronunciation features are fused with wav2vec features based on an attention mechanism to improve the generalization of fake audio detection models. Results show that the proposed approach achieves significant performance gains in several cross-dataset experiments.
翻译:现有伪造音频检测系统在域内测试中表现良好,但在域外测试中仍面临诸多挑战。这是由于训练数据与测试数据之间存在失配,且从有限视角提取的特征泛化能力不足。为解决此问题,我们提出用于伪造音频检测的多视角特征,旨在从韵律、发音和wav2vec维度捕获更泛化的特征。具体地,基于大量语音数据的预训练模型提取音素时长特征;对于发音特征,首先训练基于Conformer的音素识别模型,保留其声学编码器部分作为深度嵌入特征提取器。此外,通过注意力机制将韵律和发音特征与wav2vec特征进行融合,以提升伪造音频检测模型的泛化能力。实验结果表明,所提方法在多个跨数据集实验中实现了显著的性能提升。