In this work, we present a novel perspective on cognitive impairment classification from speech by integrating speech foundation models that explicitly recognize speech dialects. Our motivation is based on the observation that individuals with Alzheimer's Disease (AD) or mild cognitive impairment (MCI) often produce measurable speech characteristics, such as slower articulation rate and lengthened sounds, in a manner similar to dialectal phonetic variations seen in speech. Building on this idea, we introduce VoxCog, an end-to-end framework that uses pre-trained dialect models to detect AD or MCI without relying on additional modalities such as text or images. Through experiments on multiple multilingual datasets for AD and MCI detection, we demonstrate that model initialization with a dialect classifier on top of speech foundation models consistently improves the predictive performance of AD or MCI. Our trained models yield similar or often better performance compared to previous approaches that ensembled several computational methods using different signal modalities. Particularly, our end-to-end speech-based model achieves 87.5% and 85.9% accuracy on the ADReSS 2020 challenge and ADReSSo 2021 challenge test sets, outperforming existing solutions that use multimodal ensemble-based computation or LLMs.
翻译:本文提出了一种通过整合能够显式识别语音方言的语音基础模型,从语音中进行认知障碍分类的新颖视角。我们的动机基于以下观察:阿尔茨海默病(AD)或轻度认知障碍(MCI)患者通常会产生可测量的言语特征,例如较慢的发音速率和延长的音素,其方式类似于语音中观察到的方言语音变异。基于这一想法,我们提出了VoxCog,一个端到端的框架,它利用预训练的方言模型来检测AD或MCI,而无需依赖文本或图像等其他模态。通过在多个用于AD和MCI检测的多语言数据集上进行实验,我们证明,在语音基础模型之上使用方言分类器进行模型初始化,能够持续提升AD或MCI的预测性能。与先前使用不同信号模态集成多种计算方法的方案相比,我们训练的模型取得了相当甚至更优的性能。特别地,我们端到端的基于语音的模型在ADReSS 2020挑战赛和ADReSSo 2021挑战赛测试集上分别达到了87.5%和85.9%的准确率,超越了现有的基于多模态集成计算或大语言模型的解决方案。