Evaluating young children's language is challenging for automatic speech recognizers due to high-pitched voices, prolonged sounds, and limited data. We introduce K-Function, a framework that combines accurate sub-word transcription with objective, Large Language Model (LLM)-driven scoring. Its core, Kids-Weighted Finite State Transducer (K-WFST), merges an acoustic phoneme encoder with a phoneme-similarity model to capture child-specific speech errors while remaining fully interpretable. K-WFST achieves a 1.39 % phoneme error rate on MyST and 8.61 % on Multitudes-an absolute improvement of 10.47 % and 7.06 % over a greedy-search decoder. These high-quality transcripts are used by an LLM to grade verbal skills, developmental milestones, reading, and comprehension, with results that align closely with human evaluators. Our findings show that precise phoneme recognition is essential for creating an effective assessment framework, enabling scalable language screening for children.
翻译:评估幼儿语言对自动语音识别系统而言具有挑战性,原因在于其声音音调高、发音拖长且数据有限。我们提出了K-Function框架,该框架将精确的子词转写与基于大型语言模型(LLM)的客观评分相结合。其核心组件——儿童加权有限状态转换器(K-WFST)——融合了声学音素编码器与音素相似度模型,既能捕捉儿童特有的发音错误,又保持了完全的可解释性。K-WFST在MyST数据集上实现了1.39%的音素错误率,在Multitudes数据集上达到8.61%,相较于贪心搜索解码器分别绝对提升了10.47%和7.06%。这些高质量转写结果由LLM用于对言语技能、发育里程碑、阅读及理解能力进行分级评分,其结果与人类评估者高度一致。我们的研究表明,精确的音素识别对于构建有效的评估框架至关重要,能够为实现可扩展的儿童语言筛查提供支持。