Due to the subjective nature of current clinical evaluation, the need for automatic severity evaluation in dysarthric speech has emerged. DNN models outperform ML models but lack user-friendly explainability. ML models offer explainable results at a feature level, but their performance is comparatively lower. Current ML models extract various features from raw waveforms to predict severity. However, existing methods do not encompass all dysarthric features used in clinical evaluation. To address this gap, we propose a feature extraction method that minimizes information loss. We introduce an ASR transcription as a novel feature extraction source. We finetune the ASR model for dysarthric speech, then use this model to transcribe dysarthric speech and extract word segment boundary information. It enables capturing finer pronunciation and broader prosodic features. These features demonstrated an improved severity prediction performance to existing features: balanced accuracy of 83.72%.
翻译:由于当前临床评估具有主观性,对构音障碍语音进行自动严重程度评估的需求日益凸显。深度神经网络模型虽优于传统机器学习模型,但缺乏用户友好的可解释性。机器学习模型在特征层面提供可解释的结果,但其性能相对较低。现有机器学习模型从原始波形中提取多种特征以预测严重程度,然而这些方法未能涵盖临床评估中使用的全部构音障碍特征。为弥补这一不足,本文提出一种能最小化信息损失的特征提取方法。我们引入自动语音识别转录作为新型特征提取源:首先针对构音障碍语音微调ASR模型,随后利用该模型对构音障碍语音进行转录并提取词语片段边界信息。该方法既能捕捉精细的发音特征,又能提取更广泛的韵律特征。实验表明,相较于现有特征,新特征显著提升了严重程度预测性能,平衡准确率达到83.72%。