Speech patterns have been identified as potential diagnostic markers for neuropsychiatric conditions. However, most studies only compare a single clinical group to healthy controls, whereas clinical practice often requires differentiating between multiple potential diagnoses (multiclass settings). To address this, we assembled a dataset of repeated recordings from 420 participants (67 with major depressive disorder, 106 with schizophrenia and 46 with autism, as well as matched controls), and tested the performance of a range of conventional machine learning models and advanced Transformer models on both binary and multiclass classification, based on voice and text features. While binary models performed comparably to previous research (F1 scores between 0.54-0.75 for autism spectrum disorder, ASD; 0.67-0.92 for major depressive disorder, MDD; and 0.71-0.83 for schizophrenia); when differentiating between multiple diagnostic groups performance decreased markedly (F1 scores between 0.35-0.44 for ASD, 0.57-0.75 for MDD, 0.15-0.66 for schizophrenia, and 0.38-0.52 macro F1). Combining voice and text-based models yielded increased performance, suggesting that they capture complementary diagnostic information. Our results indicate that models trained on binary classification may learn to rely on markers of generic differences between clinical and non-clinical populations, or markers of clinical features that overlap across conditions, rather than identifying markers specific to individual conditions. We provide recommendations for future research in the field, suggesting increased focus on developing larger transdiagnostic datasets that include more fine-grained clinical features, and that can support the development of models that better capture the complexity of neuropsychiatric conditions and naturalistic diagnostic assessment.
翻译:语音模式已被确认为神经精神疾病的潜在诊断标记物。然而,多数研究仅将单一临床组与健康对照组进行比较,而临床实践常需在多个潜在诊断之间进行区分(多分类场景)。为解决此问题,我们构建了一个包含420名参与者重复录音的数据集(其中67名重度抑郁症患者、106名精神分裂症患者、46名自闭症患者及匹配对照组),并基于语音和文本特征,测试了一系列传统机器学习模型及先进Transformer模型在二分类与多分类任务上的表现。二分类模型的性能与既往研究相当(自闭症谱系障碍(ASD)F1分数为0.54-0.75;重度抑郁症(MDD)为0.67-0.92;精神分裂症为0.71-0.83);但在区分多个诊断组时,性能显著下降(ASD的F1分数为0.35-0.44,MDD为0.57-0.75,精神分裂症为0.15-0.66,宏平均F1为0.38-0.52)。融合语音与文本模型可提升性能,表明两者捕获互补的诊断信息。我们的结果表明,基于二分类训练的模型可能倾向于学习临床与非临床人群之间的通用差异标记物,或不同疾病间重叠的临床特征标记物,而非识别各疾病的特异性标记物。我们为该领域未来研究提出建议:应重点开发包含更细致临床特征的大规模跨诊断数据集,以支持构建能更好捕捉神经精神疾病复杂性及自然诊断评估过程的模型。