Automatically distinguishing child-directed speech from adult-directed speech in long-form recordings is key to scalable analyses of children's language environments. Existing approaches process utterances in isolation and have been evaluated primarily on English. We address these gaps along three dimensions. First, we fine-tune and evaluate six-self supervised models on a multilingual dataset of 182 children, showing that in-domain pre-training on child-centered recordings substantially outperforms models trained on adult speech. Second, we demonstrate that incorporating surrounding context substantially improves classification, with an absolute gain of 13.8% in average F1-score. Third, we evaluate our model in a realistic end-to-end pipeline, from adult speech detection to addressee classification, showing that performance drops under automatic segmentation but still consistently outperforms a rule-based baseline.
翻译:从长时录音中自动区分面向儿童的语言与面向成人的语言,是可扩展分析儿童语言环境的关键。现有方法孤立处理话语片段,且主要基于英语语料进行评估。本研究从三个维度填补上述空白:首先,我们在包含182名儿童的多语言数据集上微调并评估了六种自监督模型,结果表明基于儿童录音进行领域内预训练显著优于基于成人语音训练的模型;其次,我们证明融入周围上下文信息可大幅提升分类性能,平均F1值绝对提升达13.8%;最后,我们在从成人语音检测到受众分类的真实端到端流水线中评估模型,显示自动分割虽导致性能下降,但仍持续优于基于规则的基准模型。