One of the most crucial components in the field of biometric security is the automatic speaker verification system, which is based on the speaker's voice. It is possible to utilise ASVs in isolation or in conjunction with other AI models. In the contemporary era, the quality and quantity of neural networks are increasing exponentially. Concurrently, there is a growing number of systems that aim to manipulate data through the use of voice conversion and text-to-speech models. The field of voice biometrics forgery is aided by a number of challenges, including SSTC, ASVSpoof, and SingFake. This paper presents a system for automatic speaker verification. The primary objective of our model is the extraction of embeddings from the target speaker's audio in order to obtain information about important characteristics of his voice, such as pitch, energy, and the duration of phonemes. This information is used in our multivoice TTS pipeline, which is currently under development. However, this model was employed within the SSTC challenge to verify users whose voice had undergone voice conversion, where it demonstrated an EER of 20.669.
翻译:自动说话人验证系统是基于说话人语音的生物特征安全领域中最为关键的组成部分之一。ASV既可独立使用,也可与其他人工智能模型结合使用。当今时代,神经网络的质量和数量呈指数级增长。与此同时,旨在通过语音转换和文本转语音模型来操纵数据的系统也日益增多。语音生物特征伪造领域得到了SSTC、ASVSpoof和SingFake等多项挑战赛的推动。本文提出了一种自动说话人验证系统。我们模型的主要目标是从目标说话人的音频中提取嵌入向量,以获取其语音重要特征的信息,例如音高、能量和音素时长。这些信息被用于我们正在开发的多语音TTS流程中。该模型在SSTC挑战赛中用于验证经过语音转换的用户语音,其等错误率达到了20.669。