To understand why self-supervised learning (SSL) models have empirically achieved strong performances on several speech-processing downstream tasks, numerous studies have focused on analyzing the encoded information of the SSL layer representations in adult speech. Limited work has investigated how pre-training and fine-tuning affect SSL models encoding children's speech and vocalizations. In this study, we aim to bridge this gap by probing SSL models on two relevant downstream tasks: (1) phoneme recognition (PR) on the speech of adults, older children (8-10 years old), and younger children (1-4 years old), and (2) vocalization classification (VC) distinguishing cry, fuss, and babble for infants under 14 months old. For younger children's PR, the superiority of fine-tuned SSL models is largely due to their ability to learn features that represent older children's speech and then adapt those features to the speech of younger children. For infant VC, SSL models pre-trained on large-scale home recordings learn to leverage phonetic representations at middle layers, and thereby enhance the performance of this task.
翻译:为了理解自监督学习(SSL)模型为何在多项语音处理下游任务中经验性地取得良好表现,大量研究聚焦于分析成人语音中SSL层级表示所编码的信息。然而,关于预训练与微调如何影响SSL模型对儿童语音及发声的编码,相关研究仍十分有限。本研究旨在填补这一空白,通过探针法在两个相关下游任务中评估SSL模型:(1)针对成人、较大儿童(8-10岁)及较小儿童(1-4岁)语音的音素识别(PR)任务;(2)针对14个月以下婴儿的啼哭、烦躁与牙牙学语声的发声分类(VC)任务。对于较小儿童的音素识别,微调后的SSL模型之所以表现优越,主要归因于其学习表征较大儿童语音特征的能力,并能将这些特征适配至较小儿童的语音。对于婴儿发声分类,在大规模家庭录音上预训练的SSL模型能够利用中间层的语音表征,从而提升该任务的性能。