Speech sound disorders affect approximately 44% of Korean pediatric communication disorder cases, yet automated assessment tools for Korean toddler speech remain underdeveloped. This paper presents an end-to-end pipeline for automated pronunciation evaluation of Korean toddler speech, combining neural speaker diarization with self-supervised speech representation learning. We introduce a novel IRB-approved corpus of 53 recordings from Korean-speaking children aged 2-5 years. A subset of 53 subjects was annotated by three independent reviewers, yielding 1,190 consonant and 748 vowel word-level binary correctness labels. We evaluate three diarization models, finding that NeMo SortFormer achieves 88.69% speaker count accuracy and 33.04% diarization error rate (DER) owing to its arrival-time-sorted transformer architecture, which handles the acoustic confound between young female caregivers exhibiting aegyo and toddler speech. For pronunciation scoring, we compare three self-supervised learning (SSL) backbones across multiple pooling strategies. A cross-model ensemble routing consonant prediction to HuBERT-large and vowel prediction to WavLM-large achieves balanced accuracies of 0.720 and 0.845, with a mean of 0.782.
翻译:语音障碍约占韩语儿童沟通障碍病例的44%,然而针对韩语幼儿语音的自动化评估工具仍发展不足。本文提出了一种端到端的韩语幼儿语音自动发音评估流程,将神经说话人日志与自监督语音表征学习相结合。我们引入了一个经IRB批准的新语料库,包含53份2-5岁韩语儿童的录音。该子集由三位独立评审员标注,生成了1,190个辅音和748个元音的词级二元正确性标签。我们评估了三种说话人日志模型,发现NeMo SortFormer凭借其基于到达时间排序的Transformer架构,处理了年轻女性看护者撒娇声与幼儿语音之间的声学混淆,实现了88.69%的说话人计数准确率和33.04%的日志错误率。在发音评分方面,我们比较了三种自监督学习骨干网络在多种池化策略下的表现。一种跨模型集成方法将辅音预测分配给HuBERT-large、元音预测分配给WavLM-large,实现了0.720和0.845的平衡准确率,平均值为0.782。