Self-supervised learning (SSL) models have achieved impressive results across many speech tasks, yet child automatic speech recognition (ASR) remains challenging due to limited data and pretraining domain mismatch. Fine-tuning SSL models on child speech induces shifts in the representation space. We hypothesize that delta SSL embeddings, defined as the differences between embeddings from a finetuned model and those from its pretrained counterpart, encode task-specific information that complements finetuned features from another SSL model. We evaluate multiple fusion strategies on the MyST childrens corpus using different models. Results show that delta embedding fusion with WavLM yields up to a 10 percent relative WER reduction for HuBERT and a 4.4 percent reduction for W2V2, compared to finetuned embedding fusion. Notably, fusing WavLM with delta W2V2 embeddings achieves a WER of 9.64, setting a new state of the art among SSL models on the MyST corpus. These findings demonstrate the effectiveness of delta embeddings and highlight feature fusion as a promising direction for advancing child ASR.
翻译:自监督学习模型已在众多语音任务中取得显著成果,但儿童自动语音识别仍因数据有限及预训练领域不匹配而面临挑战。在儿童语音上微调SSL模型会导致表征空间发生偏移。我们假设Delta SSL嵌入——定义为微调模型与预训练对应模型所生成嵌入之间的差值——编码了任务特定信息,可补充来自另一SSL模型的微调特征。我们在MyST儿童语料库上使用不同模型评估了多种融合策略。结果表明,与微调嵌入融合相比,结合WavLM的Delta嵌入融合可使HuBERT获得高达10%的相对词错误率降低,W2V2获得4.4%的降低。值得注意的是,将WavLM与Delta W2V2嵌入融合实现了9.64的词错误率,在MyST语料库上创造了SSL模型的新最优性能。这些发现证明了Delta嵌入的有效性,并凸显特征融合是推进儿童自动语音识别发展的有前景方向。