Evaluating young children's language is challenging for automatic speech recognizers due to high-pitched voices, prolonged sounds, and limited data. We introduce K-Function, a framework that combines accurate sub-word transcription with objective, Large Language Model (LLM)-driven scoring. Its core, Kids-Weighted Finite State Transducer (K-WFST), merges an acoustic phoneme encoder with a phoneme-similarity model to capture child-specific speech errors while remaining fully interpretable. K-WFST achieves a 1.39 % phoneme error rate on MyST and 8.61 % on Multitudes-an absolute improvement of 10.47 % and 7.06 % over a greedy-search decoder. These high-quality transcripts are used by an LLM to grade verbal skills, developmental milestones, reading, and comprehension, with results that align closely with human evaluators. Our findings show that precise phoneme recognition is essential for creating an effective assessment framework, enabling scalable language screening for children.
翻译:由于儿童语音存在音调偏高、发音拖长及数据有限等特点,自动语音识别系统评估幼儿语言能力面临挑战。本文提出K-Function框架,该框架将精准的子词转写与基于大语言模型的客观评分机制相结合。其核心组件——儿童加权有限状态转换器通过融合声学音素编码器与音素相似度模型,在保持完全可解释性的同时,有效捕捉儿童特有的发音错误。该模型在MyST数据集上达到1.39%的音素错误率,在Multitudes数据集上达到8.61%的音素错误率,相较于贪心搜索解码器分别实现10.47%和7.06%的绝对性能提升。大语言模型利用这些高质量转写结果对言语技能、发育里程碑、阅读与理解能力进行分级评估,其结果与人类评估者高度吻合。研究表明,精确的音素识别是构建有效评估框架的关键,为实现可扩展的儿童语言筛查提供了技术基础。