Automatic Scoring of Cognition Drawings: Assessing the Quality of Machine-Based Scores Against a Gold Standard

Figure drawing is often used as part of dementia screening protocols. The Survey of Health Aging and Retirement in Europe (SHARE) has adopted three drawing tests from Addenbrooke's Cognitive Examination III as part of its questionnaire module on cognition. While the drawings are usually scored by trained clinicians, SHARE uses the face-to-face interviewers who conduct the interviews to score the drawings during fieldwork. This may pose a risk to data quality, as interviewers may be less consistent in their scoring and more likely to make errors due to their lack of clinical training. This paper therefore reports a first proof of concept and evaluates the feasibility of automating scoring using deep learning. We train several different convolutional neural network (CNN) models using about 2,000 drawings from the 8th wave of the SHARE panel in Germany and the corresponding interviewer scores, as well as self-developed 'gold standard' scores. The results suggest that this approach is indeed feasible. Compared to training on interviewer scores, models trained on the gold standard data improve prediction accuracy by about 10 percentage points. The best performing model, ConvNeXt Base, achieves an accuracy of about 85%, which is 5 percentage points higher than the accuracy of the interviewers. While this is a promising result, the models still struggle to score partially correct drawings, which are also problematic for interviewers. This suggests that more and better training data is needed to achieve production-level prediction accuracy. We therefore discuss possible next steps to improve the quality and quantity of training examples.

翻译：图形绘制常被用作痴呆症筛查方案的一部分。欧洲健康、老龄化和退休调查（SHARE）采纳了阿登布鲁克认知检查III中的三项绘图测试，作为其认知问卷模块的组成部分。虽然绘图通常由经过培训的临床医生评分，但SHARE在实地调查中采用面对面访谈员进行评分。这可能会对数据质量构成风险，因为访谈员评分一致性较低，且由于缺乏临床培训更易出错。因此，本文报告了首个概念验证，并评估了利用深度学习实现评分自动化的可行性。我们使用来自德国SHARE第八轮调查中约2000份绘图数据及对应的访谈员评分，以及自行开发的"黄金标准"评分，训练了多种卷积神经网络（CNN）模型。结果表明该方法确实可行。相较于基于访谈员评分训练的模型，基于黄金标准数据训练的模型预测准确率提升了约10个百分点。性能最优的ConvNeXt Base模型实现了约85%的准确率，较访谈员准确率高出5个百分点。尽管这一结果令人鼓舞，但模型在评估部分正确绘图时仍存在困难（这对访谈员同样具有挑战性）。这表明要实现生产级预测准确率，需要更多更优质的训练数据。因此，我们讨论了提升训练样例质量与数量的后续可能步骤。