Linguistic Indicators of Early Cognitive Decline in the DementiaBank Pitt Corpus: A Statistical and Machine Learning Study

Background: Subtle changes in spontaneous language production are among the earliest indicators of cognitive decline. Identifying linguistically interpretable markers of dementia can support transparent and clinically grounded screening approaches. Methods: This study analyzes spontaneous speech transcripts from the DementiaBank Pitt Corpus using three linguistic representations: raw cleaned text, a part-of-speech (POS)-enhanced representation combining lexical and grammatical information, and a POS-only syntactic representation. Logistic regression and random forest models were evaluated under two protocols: transcript-level train-test splits and subject-level five-fold cross-validation to prevent speaker overlap. Model interpretability was examined using global feature importance, and statistical validation was conducted using Mann-Whitney U tests with Cliff's delta effect sizes. Results: Across representations, models achieved stable performance, with syntactic and grammatical features retaining strong discriminative power even in the absence of lexical content. Subject-level evaluation yielded more conservative but consistent results, particularly for POS-enhanced and POS-only representations. Statistical analysis revealed significant group differences in functional word usage, lexical diversity, sentence structure, and discourse coherence, aligning closely with machine learning feature importance findings. Conclusion: The results demonstrate that abstract linguistic features capture robust markers of early cognitive decline under clinically realistic evaluation. By combining interpretable machine learning with non-parametric statistical validation, this study supports the use of linguistically grounded features for transparent and reliable language-based cognitive screening.

翻译：背景：自发性语言产出的细微变化是认知衰退的最早期指标之一。识别具有语言学可解释性的痴呆标志物，有助于建立透明且基于临床依据的筛查方法。方法：本研究使用三种语言学表征分析DementiaBank Pitt语料库的自发性言语转录文本：原始清洗文本、结合词汇与语法信息的词性（POS）增强表征，以及仅包含句法信息的POS表征。采用逻辑回归与随机森林模型，在两种实验协议下进行评估：基于转录文本的常规训练-测试划分，以及防止说话人重叠的受试者层级五折交叉验证。通过全局特征重要性分析模型可解释性，并采用曼-惠特尼U检验与Cliff's delta效应量进行统计验证。结果：所有表征的模型均表现稳定，即使在缺乏词汇内容的情况下，句法与语法特征仍保持较强的判别力。受试者层级评估获得更保守但一致的结果，尤其在POS增强与纯POS表征中更为显著。统计分析显示，两组在功能词使用、词汇多样性、句子结构和语篇连贯性方面存在显著差异，该结果与机器学习特征重要性分析高度吻合。结论：研究结果表明，在符合临床实际的评估条件下，抽象语言学特征能够捕捉早期认知衰退的稳健标志物。通过将可解释机器学习与非参数统计验证相结合，本研究为基于语言学特征构建透明可靠的语言认知筛查方法提供了实证支持。