The assessment of children at risk of autism typically involves a clinician observing, taking notes, and rating children's behaviors. A machine learning model that can label adult and child audio may largely save labor in coding children's behaviors, helping clinicians capture critical events and better communicate with parents. In this study, we leverage Wav2Vec 2.0 (W2V2), pre-trained on 4300-hour of home audio of children under 5 years old, to build a unified system for tasks of clinician-child speaker diarization and vocalization classification (VC). To enhance children's VC, we build a W2V2 phoneme recognition system for children under 4 years old, and we incorporate its phonetically-tuned embeddings as auxiliary features or recognize pseudo phonetic transcripts as an auxiliary task. We test our method on two corpora (Rapid-ABC and BabbleCor) and obtain consistent improvements. Additionally, we outperform the state-of-the-art performance on the reproducible subset of BabbleCor. Code available at https://huggingface.co/lijialudew
翻译:自闭症风险儿童的评估通常需要临床医生观察、记录并评估儿童行为。一种能够标注成人与儿童音频的机器学习模型可大幅节省编码儿童行为所需的人力,帮助临床医生捕捉关键事件并更好地与家长沟通。本研究利用基于4300小时5岁以下儿童家庭音频预训练的Wav2Vec 2.0(W2V2)模型,构建了面向临床医生-儿童说话人日志与发声分类任务的统一系统。为提升儿童发声分类性能,我们针对4岁以下儿童构建了W2V2音素识别系统,将其语音调谐嵌入作为辅助特征,或将识别出的伪音素转录作为辅助任务。我们在两个语料库(Rapid-ABC和BabbleCor)上验证了该方法,均获得稳定性能提升。此外,在BabbleCor可复现子集上我们的方法超越了现有最优性能。代码发布于https://huggingface.co/lijialudew