Vision transformers (ViT) have made substantial progress for classification tasks in computer vision. Recently, Gong et. al. '21, introduced attention-based modeling for several audio tasks. However, relatively unexplored is the use of a ViT for audio spoof detection task. We bridge this gap and introduce ViTs for this task. A vanilla baseline built on fine-tuning the SSAST (Gong et. al. '22) audio ViT model achieves sub-optimal equal error rates (EERs). To improve performance, we propose a novel attention-based contrastive learning framework (SSAST-CL) that uses cross-attention to aid the representation learning. Experiments show that our framework successfully disentangles the bonafide and spoof classes and helps learn better classifiers for the task. With appropriate data augmentations policy, a model trained on our framework achieves competitive performance on the ASVSpoof 2021 challenge. We provide comparisons and ablation studies to justify our claim.
翻译:视觉Transformer(ViT)在计算机视觉分类任务中取得了显著进展。近期,Gong等人(2021)将基于注意力的建模引入多个音频任务。然而,将ViT应用于音频伪造检测任务的研究相对较少。我们填补了这一空白,将ViT引入该任务。基于微调SSAST(Gong等人,2022)音频ViT模型构建的基准方法仅能获得次优的等错误率(EER)。为提升性能,我们提出一种新颖的基于注意力的对比学习框架(SSAST-CL),该框架利用交叉注意力机制辅助表征学习。实验表明,我们的框架能有效分离真实音频与伪造音频类别,并有助于学习更优的分类器。配合适当的数据增强策略,基于本框架训练的模型在ASVSpoof 2021挑战中取得了具有竞争力的性能。我们通过对比实验与消融研究验证了所提方法的有效性。