Speech-based analysis offers a scalable and non-invasive approach for detecting cognitive decline, yet progress has been constrained by the limited availability of clinically validated datasets collected under realistic conditions. We introduce PROCESS-2, a large-scale speech dataset designed to support research on automatic assessment of cognitive impairment from spontaneous and task-oriented speech. The dataset comprises recordings from 200 healthy controls, 150 mild cognitive impairment, and 50 dementia diagnoses collected using the CognoMemory digital assessment platform. Each participant completed a single assessment session, including picture description and verbal fluency tasks, accompanied by manually verified transcripts and participant-level metadata. PROCESS-2 contains approximately 21 hours of speech audio with predefined train/test partitions. Comprehensive technical validation evaluated demographic balance, clinical consistency, recording stability, embedding-space structure, and reproducible baseline modelling performance, demonstrating clinically meaningful group separation and stable performance across modelling approaches while preserving real-world conversational variability. PROCESS-2 is released under controlled access via Hugging Face to enable responsible reuse while protecting participant privacy, providing a reproducible benchmark resource for speech-based cognitive assessment research.
翻译:基于语音的分析为认知衰退检测提供了一种可扩展且非侵入性的方法,然而,由于缺乏在真实条件下收集的临床验证数据集,相关研究进展受到限制。我们提出了PROCESS-2,这是一个大规模语音数据集,旨在支持从自发性和任务导向性语音中自动评估认知障碍的研究。该数据集包含来自200名健康对照组、150名轻度认知障碍患者和50名痴呆症诊断患者的录音,均通过CognoMemory数字评估平台收集。每位参与者完成一次评估会话,包括图片描述和语言流畅性任务,并附有手工验证的转录文本和参与者层面的元数据。PROCESS-2包含约21小时的语音音频,并预设了训练/测试划分。全面的技术验证评估了人口统计学平衡性、临床一致性、录音稳定性、嵌入空间结构以及可复现的基线建模性能,结果表明模型能够实现具有临床意义的组间区分,并在保持真实对话变异性的同时,跨建模方法展现出稳定性能。PROCESS-2在Hugging Face上采用受控访问发布,以在保护参与者隐私的同时促进负责任的复用,为基于语音的认知评估研究提供了可复现的基准资源。