To understand why self-supervised learning (SSL) models have empirically achieved strong performances on several speech-processing downstream tasks, numerous studies have focused on analyzing the encoded information of the SSL layer representations in adult speech. Limited work has investigated how pre-training and fine-tuning affect SSL models encoding children's speech and vocalizations. In this study, we aim to bridge this gap by probing SSL models on two relevant downstream tasks: (1) phoneme recognition (PR) on the speech of adults, older children (8-10 years old), and younger children (1-4 years old), and (2) vocalization classification (VC) distinguishing cry, fuss, and babble for infants under 14 months old. For younger children's PR, the superiority of fine-tuned SSL models is largely due to their ability to learn features that represent older children's speech and then adapt those features to the speech of younger children. For infant VC, SSL models pre-trained on large-scale home recordings learn to leverage phonetic representations at middle layers, and thereby enhance the performance of this task.
翻译:为理解自监督学习模型为何在多项语音处理下游任务中经验性地取得优异表现,已有大量研究聚焦于分析其在成人语音中SSL层表征所编码的信息。关于预训练与微调如何影响SSL模型对儿童语音及发声的编码机制,现有研究仍较为有限。本研究旨在通过探究SSL模型在两项相关下游任务上的表现来弥合这一差距:(1)针对成人、年长儿童(8-10岁)及年幼儿童(1-4岁)语音的音素识别任务;(2)针对14个月以下婴儿哭声、烦躁声与咿呀声的发声分类任务。对于年幼儿童的音素识别任务,微调后SSL模型的优越性主要源于其能够学习表征年长儿童语音的特征,并将这些特征适配至年幼儿童的语音。对于婴儿发声分类任务,在大规模家庭录音数据上预训练的SSL模型能够有效利用中间层的音素表征,从而提升该任务的性能。