Automatic Speaker Verification (ASV) systems, which identify speakers based on their voice characteristics, have numerous applications, such as user authentication in financial transactions, exclusive access control in smart devices, and forensic fraud detection. However, the advancement of deep learning algorithms has enabled the generation of synthetic audio through Text-to-Speech (TTS) and Voice Conversion (VC) systems, exposing ASV systems to potential vulnerabilities. To counteract this, we propose a novel architecture named AASIST3. By enhancing the existing AASIST framework with Kolmogorov-Arnold networks, additional layers, encoders, and pre-emphasis techniques, AASIST3 achieves a more than twofold improvement in performance. It demonstrates minDCF results of 0.5357 in the closed condition and 0.1414 in the open condition, significantly enhancing the detection of synthetic voices and improving ASV security. \textbf{The new version of the model is publicly available at \href{https://huggingface.co/lab260/Spectra-AASIST3}{\underline{HuggingFace (2026)}}}
翻译:自动说话人验证(ASV)系统通过声音特征识别说话人身份,在金融交易用户认证、智能设备权限管控及司法取证欺诈检测等领域具有广泛应用。然而,深度学习算法的进步使得通过文本转语音(TTS)和语音转换(VC)系统生成合成音频成为可能,导致ASV系统面临潜在安全威胁。针对此问题,我们提出名为AASIST3的新型架构。该架构通过集成Kolmogorov-Arnold网络、新增网络层、编码器及预加重技术对现有AASIST框架进行增强,实现了超过两倍的性能提升。在封闭条件下,minDCF结果为0.5357,开放条件下为0.1414,显著提升了合成语音检测能力及ASV系统安全性。\textbf{新版模型已公开于\href{https://huggingface.co/lab260/Spectra-AASIST3}{\underline{HuggingFace(2026)}}。}