A speech spoofing countermeasure (CM) that discriminates between unseen spoofed and bona fide data requires diverse training data. While many datasets use spoofed data generated by speech synthesis systems, it was recently found that data vocoded by neural vocoders were also effective as the spoofed training data. Since many neural vocoders are fast in building and generation, this study used multiple neural vocoders and created more than 9,000 hours of vocoded data on the basis of the VoxCeleb2 corpus. This study investigates how this large-scale vocoded data can improve spoofing countermeasures that use data-hungry self-supervised learning (SSL) models. Experiments demonstrated that the overall CM performance on multiple test sets improved when using features extracted by an SSL model continually trained on the vocoded data. Further improvement was observed when using a new SSL distilled from the two SSLs before and after the continual training. The CM with the distilled SSL outperformed the previous best model on challenging unseen test sets, including the ASVspoof 2019 logical access, WaveFake, and In-the-Wild.
翻译:语音欺骗检测模型(CM)需通过多样化训练数据区分未见过的欺骗语音与真实语音。尽管多数数据集采用语音合成系统生成的欺骗数据,但近期研究发现,神经声码器生成的声码数据同样可作为有效的欺骗训练数据。鉴于神经声码器具有构建和生成速度快的优势,本研究基于VoxCeleb2语料库,使用多种神经声码器构建了超过9,000小时的声码数据。本文探究大规模声码数据如何提升依赖大量数据的自监督学习(SSL)模型的欺骗检测性能。实验表明,当采用持续在声码数据上训练的SSL模型提取特征时,模型在多个测试集上的整体CM性能得到提升。进一步,通过将持续训练前后的两个SSL模型进行知识蒸馏获得的新型SSL模型,其检测性能进一步提升。结合该蒸馏SSL模型的CM系统在具有挑战性的未见测试集(包括ASVspoof 2019逻辑访问、WaveFake和In-the-Wild)上超越了此前最优模型。