The accuracy of modern automatic speaker verification (ASV) systems, when trained exclusively on adult data, drops substantially when applied to children's speech. The scarcity of children's speech corpora hinders fine-tuning ASV systems for children's speech. Hence, there is a timely need to explore more effective ways of reusing adults' speech data. One promising approach is to align vocal-tract parameters between adults and children through children-specific data augmentation, referred here to as ChildAugment. Specifically, we modify the formant frequencies and formant bandwidths of adult speech to emulate children's speech. The modified spectra are used to train ECAPA-TDNN (emphasized channel attention, propagation, and aggregation in time-delay neural network) recognizer for children. We compare ChildAugment against various state-of-the-art data augmentation techniques for children's ASV. We also extensively compare different scoring methods, including cosine scoring, PLDA (probabilistic linear discriminant analysis), and NPLDA (neural PLDA). We also propose a low-complexity weighted cosine score for extremely low-resource children ASV. Our findings on the CSLU kids corpus indicate that ChildAugment holds promise as a simple, acoustics-motivated approach, for improving state-of-the-art deep learning based ASV for children. We achieve up to 12.45% (boys) and 11.96% (girls) relative improvement over the baseline.
翻译:现代自动说话人验证(ASV)系统若仅以成人数据训练,在应用于儿童语音时准确率显著下降。儿童语音语料库的匮乏阻碍了ASV系统针对儿童语音的微调,因此亟需探索更有效的成人语音数据复用方法。一种有前景的方案是通过针对儿童的特定数据增强(本文称为ChildAugment)对齐成人与儿童的声道参数。具体而言,我们修改成人语音的共振峰频率和带宽以模拟儿童语音,并将修正后的频谱用于训练面向儿童的ECAPA-TDNN(时延神经网络中增强的通道注意力、传播与聚合)识别器。我们将ChildAugment与多种最先进的儿童ASV数据增强技术进行对比,并系统比较了包括余弦评分、PLDA(概率线性判别分析)和NPLDA(神经PLDA)在内的不同评分方法。针对极低资源儿童ASV场景,我们还提出一种低复杂度的加权余弦评分。在CSLU儿童语料库上的实验表明,ChildAugment作为一种简洁的声学驱动方法,有望提升基于深度学习的儿童ASV系统的性能。相较于基线,我们在男孩和女孩数据集上分别实现了最高12.45%和11.96%的相对性能提升。