Voice-based screening offers a scalable and non-invasive way to assess neurodegenerative diseases such as Alzheimer's disease (AD) and Parkinson's disease (PD), but their staging remains challenging due to the difficulty of integrating heterogeneous data. This paper presents NeurMLLM, an efficient multimodal generative framework for neurodegenerative disease staging. NeurMLLM first encodes the spectrograms and Mel-frequency cepstral coefficients of audio data with vision transformers and projects their representations into the embedding space of a large language model (LLM), where they are concatenated with transcript and demographic instruction tokens as a single unified sequence. The LLM is then instruction-tuned via Low-Rank Adaptation using task prompts to autoregressively predict a constrained label token, enabling a generative classification. By evaluating on the Bridge2AI-Voice dataset for fine-grained staging of AD and PD, we observe that NeurMLLM achieves strong performance, consistently outperforming classical machine learning methods and existing LLM-based approaches. The results show the high potential of multimodal LLMs in neurodegenerative disease staging, improving staging accuracy and supporting accessible deployment.
翻译:基于语音的筛查提供了一种可扩展且非侵入性的方法,用于评估阿尔茨海默病和帕金森病等神经退行性疾病,但由于异构数据整合困难,其分期仍然具有挑战性。本文提出NeurMLLM,一种用于神经退行性疾病分期的高效多模态生成式框架。NeurMLLM首先通过视觉变换器对音频数据的频谱图和梅尔频率倒谱系数进行编码,并将其表示投影到大语言模型的嵌入空间中,在此处将其与转录文本和人口统计指令令牌拼接为单一统一序列。随后,通过低秩适应利用任务提示对大语言模型进行指令微调,以自回归方式预测受限的标签令牌,从而实现生成式分类。通过使用Bridge2AI-Voice数据集对AD和PD进行细粒度分期评估,我们观察到NeurMLLM展现出强大性能,持续优于经典机器学习方法和现有基于大语言模型的方法。结果表明,多模态大语言模型在神经退行性疾病分期中具有巨大潜力,能够提高分期准确性并支持可及性部署。