Speaker-Independent Dysarthria Severity Classification using Self-Supervised Transformers and Multi-Task Learning

Dysarthria, a condition resulting from impaired control of the speech muscles due to neurological disorders, significantly impacts the communication and quality of life of patients. The condition's complexity, human scoring and varied presentations make its assessment and management challenging. This study presents a transformer-based framework for automatically assessing dysarthria severity from raw speech data. It can offer an objective, repeatable, accessible, standardised and cost-effective and compared to traditional methods requiring human expert assessors. We develop a transformer framework, called Speaker-Agnostic Latent Regularisation (SALR), incorporating a multi-task learning objective and contrastive learning for speaker-independent multi-class dysarthria severity classification. The multi-task framework is designed to reduce reliance on speaker-specific characteristics and address the intrinsic intra-class variability of dysarthric speech. We evaluated on the Universal Access Speech dataset using leave-one-speaker-out cross-validation, our model demonstrated superior performance over traditional machine learning approaches, with an accuracy of $70.48\%$ and an F1 score of $59.23\%$. Our SALR model also exceeded the previous benchmark for AI-based classification, which used support vector machines, by $16.58\%$. We open the black box of our model by visualising the latent space where we can observe how the model substantially reduces speaker-specific cues and amplifies task-specific ones, thereby showing its robustness. In conclusion, SALR establishes a new benchmark in speaker-independent multi-class dysarthria severity classification using generative AI. The potential implications of our findings for broader clinical applications in automated dysarthria severity assessments.

翻译：构音障碍是由神经系统疾病导致言语肌肉控制受损所致，严重影响患者的沟通能力和生活质量。该病症的复杂性、人工评分方式及多样的临床表现给评估和管理带来挑战。本研究提出基于Transformer的框架，可从原始语音数据自动评估构音障碍严重程度。相比需要人类专家评估的传统方法，该框架具有客观、可重复、易获取、标准化和低成本的优势。我们开发了名为说话人无关潜在正则化（SALR）的Transformer框架，结合多任务学习目标和对比学习，实现说话人无关的多类别构音障碍严重程度分类。该多任务框架旨在降低对说话人特定特征的依赖，并应对构音障碍语音固有的类内变异。在Universal Access Speech数据集上采用留一说话人交叉验证进行评估，我们的模型相比传统机器学习方法表现更优，准确率达70.48%，F1分数达59.23%。同时，SALR模型比此前基于支持向量机的AI分类基准提升了16.58%。通过可视化潜在空间，我们揭示了模型内部机制，观察到模型显著减少说话人特定特征并增强任务相关特征，从而证明了其鲁棒性。综上，SALR利用生成式AI在说话人无关多类别构音障碍严重程度分类中确立了新的基准。本研究结果对构音障碍严重程度自动化评估的广泛临床应用具有潜在意义。