Objective: this study has a twofold goal. First, it aims to improve the understanding of the impact of Dementia of type Alzheimer's Disease (AD) on different aspects of the lexicon. Second, it aims to demonstrate that such aspects of the lexicon, when used as features of a machine learning classifier, can help achieve state-of-the-art performance in automatically identifying language samples produced by patients with AD. Methods: data is derived from the ADDreSS challenge, which is a part of the DementiaBank corpus. The used dataset consists of transcripts of Cookie Theft picture descriptions, produced by 54 subjects in the training part and 24 subjects in the test part. The number of narrative samples is 108 in the training set and 48 in the test set. First, the impact of AD on 99 selected lexical features is studied using both the training and testing parts of the dataset. Then some machine learning experiments were conducted on the task of classifying transcribed speech samples with text samples that were produced by people with AD from those produced by normal subjects. Several experiments were conducted to compare the different areas of lexical complexity, identify the subset of features that help achieve optimal performance, and study the impact of the size of the input on the classification. To evaluate the generalization of the models built on narrative speech, two generalization tests were conducted using written data from two British authors, Iris Murdoch and Agatha Christie, and the transcription of some speeches by former President Ronald Reagan. Results: using lexical features only, state-of-the-art classification, F1 and accuracies, of over 91% were achieved in categorizing language samples produced by individuals with AD from the ones produced by healthy control subjects. This confirms the substantial impact of AD on lexicon processing.
翻译:目的:本研究具有双重目标。首先,旨在增进对阿尔茨海默病型痴呆对词汇不同方面影响的理解。其次,旨在证明这些词汇方面作为机器学习分类器的特征使用时,能够帮助实现自动识别阿尔茨海默病患者语言样本的最新性能。方法:数据源自ADDreSS挑战赛,该挑战赛是DementiaBank语料库的一部分。使用的数据集包括"偷饼干"图片描述任务的转录文本,由训练部分的54名受试者和测试部分的24名受试者产生。训练集中有108个叙述样本,测试集中有48个样本。首先,利用数据集的训练和测试部分研究阿尔茨海默病对99个选定词汇特征的影响。随后,在分类任务中开展机器学习实验,将阿尔茨海默病患者产生的转录语音样本与正常受试者的文本样本进行区分。进行了多项实验以比较词汇复杂性的不同方面,确定有助于实现最优性能的特征子集,并研究输入大小对分类的影响。为了评估基于叙述性语音构建模型的泛化能力,使用两位英国作家艾丽丝·默多克和阿加莎·克里斯蒂的书面数据以及前总统罗纳德·里根部分演讲的转录文本进行了两项泛化测试。结果:仅使用词汇特征,在区分阿尔茨海默病患者与健康对照者产生的语言样本时,实现了超过91%的最新分类F1分数和准确率。这证实了阿尔茨海默病对词汇处理的实质性影响。