Fake artefacts for discriminating between bonafide and fake audio can exist in both short- and long-range segments. Therefore, combining local and global feature information can effectively discriminate between bonafide and fake audio. This paper proposes an end-to-end bidirectional state space model, named RawBMamba, to capture both short- and long-range discriminative information for audio deepfake detection. Specifically, we use sinc Layer and multiple convolutional layers to capture short-range features, and then design a bidirectional Mamba to address Mamba's unidirectional modelling problem and further capture long-range feature information. Moreover, we develop a bidirectional fusion module to integrate embeddings, enhancing audio context representation and combining short- and long-range information. The results show that our proposed RawBMamba achieves a 34.1\% improvement over Rawformer on ASVspoof2021 LA dataset, and demonstrates competitive performance on other datasets.
翻译:区分真实音频与伪造音频的伪造伪影可能同时存在于短时域与长时域片段中。因此,结合局部与全局特征信息能有效区分真实音频与伪造音频。本文提出一种名为RawBMamba的端到端双向状态空间模型,旨在捕获用于音频深度伪造检测的短时域与长时域判别信息。具体而言,我们使用sinc层与多个卷积层捕获短时域特征,随后设计了一个双向Mamba以解决Mamba的单向建模问题,并进一步捕获长时域特征信息。此外,我们开发了一个双向融合模块来整合嵌入表示,从而增强音频上下文表征并融合短时域与长时域信息。实验结果表明,我们提出的RawBMamba在ASVspoof2021 LA数据集上相比Rawformer实现了34.1%的性能提升,并在其他数据集上展现出具有竞争力的性能。