超越长度：语境感知扩展与独立性作为儿童话语的发展敏感性评估 (Beyond Length: Context-Aware Expansion and Independence as Developmentally Sensitive Evaluation in Child Utterances)

Evaluating the quality of children's utterances in adult-child dialogue remains challenging due to insufficient context-sensitive metrics. Common proxies such as Mean Length of Utterance (MLU), lexical diversity (vocd-D), and readability indices (Flesch-Kincaid Grade Level, Gunning Fog Index) are dominated by length and ignore conversational context, missing aspects of response quality such as reasoning depth, topic maintenance, and discourse planning. We introduce an LLM-as-a-judge framework that first classifies the Previous Adult Utterance Type and then scores the child's response along two axes: Expansion (contextual elaboration and inferential depth) and Independence (the child's contribution to advancing the discourse). These axes reflect fundamental dimensions in child language development, where Expansion captures elaboration, clause combining, and causal and contrastive connectives. Independence captures initiative, topic control, decreasing reliance on adult scaffolding through growing self-regulation, and audience design. We establish developmental validity by showing age-related patterns and demonstrate predictive value by improving age estimation over common baselines. We further confirm semantic sensitivity by detecting differences tied to discourse relations. Our metrics align with human judgments, enabling large-scale evaluation. This shifts child utterance assessment from simply measuring length to evaluating how meaningfully the child's speech contributes to and advances the conversation within its context.

翻译：评估成人-儿童对话中儿童话语的质量仍然具有挑战性，主要原因是缺乏足够的语境敏感指标。常用的代理指标，如平均话语长度（MLU）、词汇多样性（vocd-D）和可读性指数（Flesch-Kincaid 年级水平、Gunning Fog 指数），主要由长度主导且忽略了对话语境，因而遗漏了回应质量的诸多方面，例如推理深度、话题维持和语篇规划。我们引入了一个LLM-as-a-judge框架，该框架首先对先前成人话语类型进行分类，然后沿着两个轴对儿童的回应进行评分：扩展性（语境阐述和推理深度）和独立性（儿童对推进语篇的贡献）。这两个轴反映了儿童语言发展的基本维度，其中扩展性捕捉了阐述、从句组合以及因果和对比连接词的使用。独立性则捕捉了主动性、话题控制力、通过日益增强的自我调节减少对成人支架的依赖，以及受众设计。我们通过展示与年龄相关的模式确立了发展效度，并通过改进年龄估计（相较于常见基线）证明了其预测价值。我们进一步通过检测与语篇关系相关的差异，确认了其语义敏感性。我们的指标与人类判断一致，从而能够进行大规模评估。这将儿童话语评估从简单地测量长度，转向评估儿童的言语在其语境中如何有意义地贡献于并推进对话。