A Multimodal Framework for Dementia Detection via Linguistic and Acoustic Representation Learning

Alzheimer's disease (AD) is a progressive neurodegenerative disorder and the leading cause of dementia, affecting memory, reasoning, communication, and daily functioning. Early diagnosis is particularly important, as timely intervention may help slow cognitive decline and improve patient care. Recent studies have demonstrated that spontaneous speech contains valuable linguistic and acoustic biomarkers associated with dementia. However, existing approaches often rely on independently trained modality-specific models, feature concatenation strategies, ensemble methods, or attention-based fusion mechanisms that do not explicitly maximize the dependency between speech and transcript representations. In this work, we propose a multimodal deep learning framework for automatic dementia detection that jointly exploits speech and transcript information in an end-to-end trainable manner. Specifically, speech recordings are divided into 10-second segments and passed through a pre-trained HuBERT model to extract contextualized acoustic representations. To better capture informative temporal speech characteristics, attentive statistics pooling is employed to aggregate frame-level acoustic embeddings. For the textual modality, transcripts are encoded using a pre-trained BERT model, where the [CLS] token representation is used as the linguistic embedding. The acoustic and textual representations are subsequently combined using an attention-based Audio-Text Fusion (AT-Fusion) mechanism. In addition, we introduce a MINE objective to maximize the mutual information between modalities and improve multimodal representation alignment. The fused multimodal representation is finally used for dementia classification. Experiments conducted on the publicly available ADReSS Challenge and PROCESS-2 dataset demonstrate the effectiveness and robustness of the proposed approach for speech-based dementia assessment.

翻译：阿尔茨海默病（AD）是一种进行性神经退行性疾病，也是导致痴呆的主要原因，会影响记忆、推理、沟通和日常功能。早期诊断尤为重要，因为及时干预可能有助于减缓认知衰退并改善患者护理。近年研究表明，自发语音中包含与痴呆相关的有价值的语言和声学生物标志物。然而，现有方法通常依赖独立训练的模态专用模型、特征拼接策略、集成方法或基于注意力的融合机制，这些方法未能显式最大化语音和文本表示之间的依赖性。在本工作中，我们提出了一种用于自动痴呆检测的多模态深度学习框架，以端到端可训练的方式联合利用语音和文本信息。具体而言，语音录音被分割为10秒片段，通过预训练的HuBERT模型提取上下文相关的声学表示。为更好地捕捉具有信息性的时域语音特征，采用注意力统计池化聚合帧级声学嵌入。对于文本模态，使用预训练的BERT模型对转录文本进行编码，其中[CLS]标记的表示用作语言嵌入。随后，采用基于注意力的音频-文本融合（AT-Fusion）机制组合声学和文本表示。此外，我们引入MINE目标函数以最大化模态间的互信息并改善多模态表示对齐。融合后的多模态表示最终用于痴呆分类。在公开的ADReSS挑战赛和PROCESS-2数据集上进行的实验验证了所提方法在基于语音的痴呆评估中的有效性和鲁棒性。