Reliable early detection of Alzheimer's disease (AD) is challenging, particularly due to the limited availability of labeled data. While large language models (LLMs) have shown strong transfer capabilities across do mains, adapting them to the AD domain through supervised fine-tuning remains largely unexplored. In this work, we empirically evaluate various model architectures across three heterogeneous transcript corpora (Pitt, CCC, ADRC) to investigate their effectiveness for text-based AD detection and analyze how task-relevant information is encoded within their internal representations. To the best of our knowledge, our fine-tuned BERT and T5 models establish a new state-of-the-art on the Pitt and CCC datasets, while achieving strong performance on ADRC. In parallel, the decoder-only Llama-1B achieves highly competitive results comparable to BERT and T5 across all three corpora, highlighting its effectiveness for AD detection. We further conduct a comprehensive evaluation of the Llama-1B backbone, analyzing cross-corpus transferability, optimal input chunk-size granularity, and the impact of clinical transcript markers. Also, we use linear probing to empirically show that fine-tuning shifts the representations of individual tokens, both linguistic markers and content words, in ways that reflect AD-related signal.
翻译:阿尔茨海默病(AD)的可靠早期检测极具挑战性,尤其是由于标注数据有限。尽管大语言模型(LLMs)在跨领域迁移方面展现出强大能力,但通过监督微调将其适配至AD领域的研究仍相对匮乏。本研究在三个异质性语料库(Pitt、CCC、ADRC)上对不同模型架构进行实证评估,探究其基于文本的AD检测效果,并分析任务相关信息如何编码于模型内部表征中。据我们所知,经过微调的BERT和T5模型在Pitt和CCC数据集上达到了新的最优性能,同时在ADRC数据集上表现优异。此外,仅含解码器的Llama-1B模型在三个语料库上均取得与BERT和T5高度竞争的结果,凸显其AD检测的有效性。我们进一步对Llama-1B骨干模型展开全面评估,包括跨语料库迁移能力、最优输入分块粒度以及临床转录标记的影响。同时,采用线性探针法实证表明:微调通过改变语言标记和内容词等单个词元表征的分布,从而编码与AD相关的信号特征。