Automated stuttering detection (ASD) systems struggle with paediatric speech due to high acoustic variability in developing voices and the subtle distinction between pathological stuttering and typical developmental disfluencies. We introduce Paediatric-HGNN, a framework using a Context-aware Part-whole Interaction Network (CaPIN) tailored for paediatric data. Instead of conventional 1D signal modelling, our approach builds a heterogeneous graph capturing hierarchical relationships between lexical units (word nodes) and fine-grained acoustic segments (frame nodes). Trained on curated paediatric corpora (UCLASS and FluencyBank), Paediatric-HGNN achieves 82.4% weighted accuracy and a Typical Disfluency F1-score of 0.386. Modelling hierarchical lexical-acoustic interactions captures developmental "searching" behaviour, offering a more robust and interpretable tool for early clinical intervention.
翻译:自动口吃检测(ASD)系统在处理儿童语音时面临挑战,原因在于发育期声音的高声学变异性以及病理性口吃与典型发育性不流畅之间的细微差异。我们提出了儿童口吃-HGNN框架,该框架采用专为儿童数据定制的上下文感知部分-整体交互网络(CaPIN)。与传统的1D信号建模不同,我们的方法构建了一个异构图,用于捕捉词汇单元(词节点)与细粒度声学片段(帧节点)之间的层次关系。通过在精心整理的儿童语料库(UCLASS和FluencyBank)上训练,儿童口吃-HGNN实现了82.4%的加权准确率和0.386的典型不流畅F1分数。对层次化词汇-声学交互的建模能够捕捉发育期的“搜索”行为,从而为早期临床干预提供更稳健且可解释的工具。