Synthetic and manipulated speech can reduce the reliability of automatic speaker verification systems, so anti-spoofing methods need to be both accurate and efficient in training and inference. This paper focuses on the ASVspoof 5 Track 1 closed condition, where standard cross-entropy training may not give enough attention to hard trials and is not directly aligned with ranking- and threshold-based evaluation metrics. We propose TFPARN, a Transformer-based focal-pairwise attentive ranking network. The system extracts log-Mel features from speech, uses a Transformer encoder to model frame-level information, applies attention pooling to obtain utterance-level representations, and is trained with a combination of focal classification loss and pairwise ranking loss. RawBoost augmentation is used during training, and test-time augmentation is applied during evaluation to improve robustness. Compared with re-implemented AASIST and RawNet2 baselines under the same protocol, TFPARN achieves the best results, with a minDCF of 0.2430 and an EER of 12.52%. Ablation experiments further show that the pairwise loss, focal loss, and attention pooling all improve performance. TFPARN also uses the lowest inference memory among the compared systems, at 1.4 GB, runs at about 0.79 ms per utterance, and reaches its best checkpoint in less training time than AASIST. These results show that TFPARN provides a good balance between detection accuracy and computational cost for logical access anti-spoofing.
翻译:合成及篡改语音会降低自动说话人验证系统的可靠性,因此反欺骗方法需在训练与推理过程中兼具高精度与高效率。本文聚焦ASVspoof 5赛道1封闭条件,指出标准交叉熵训练可能对困难样本关注不足,且无法直接对齐基于排序与阈值的评估指标。我们提出TFPARN——一种基于Transformer的焦点-成对注意力排序网络。该系统从语音中提取对数梅尔特征,利用Transformer编码器建模帧级信息,通过注意力池化获取话语级表征,并采用焦点分类损失与成对排序损失的联合训练策略。训练阶段使用RawBoost数据增强,评估阶段应用测试时增强以提升鲁棒性。在与相同协议下重新实现的AASIST和RawNet2基线对比中,TFPARN取得最优结果:最小检测代价函数(minDCF)为0.2430,等错误率(EER)为12.52%。消融实验进一步表明,成对损失、焦点损失及注意力池化均能提升性能。TFPARN在对比系统中推理内存最低(1.4 GB),每句话处理耗时约0.79毫秒,且达到最佳检查点所需的训练时间少于AASIST。上述结果表明,TFPARN在逻辑访问反欺骗任务中实现了检测精度与计算成本的良好平衡。