Automatic Scoring of Cognition Drawings: Assessing the Quality of Machine-Based Scores Against a Gold Standard

Figure drawing is often used as part of dementia screening protocols. The Survey of Health Aging and Retirement in Europe (SHARE) has adopted three drawing tests from Addenbrooke's Cognitive Examination III as part of its questionnaire module on cognition. While the drawings are usually scored by trained clinicians, SHARE uses the face-to-face interviewers who conduct the interviews to score the drawings during fieldwork. This may pose a risk to data quality, as interviewers may be less consistent in their scoring and more likely to make errors due to their lack of clinical training. This paper therefore reports a first proof of concept and evaluates the feasibility of automating scoring using deep learning. We train several different convolutional neural network (CNN) models using about 2,000 drawings from the 8th wave of the SHARE panel in Germany and the corresponding interviewer scores, as well as self-developed 'gold standard' scores. The results suggest that this approach is indeed feasible. Compared to training on interviewer scores, models trained on the gold standard data improve prediction accuracy by about 10 percentage points. The best performing model, ConvNeXt Base, achieves an accuracy of about 85%, which is 5 percentage points higher than the accuracy of the interviewers. While this is a promising result, the models still struggle to score partially correct drawings, which are also problematic for interviewers. This suggests that more and better training data is needed to achieve production-level prediction accuracy. We therefore discuss possible next steps to improve the quality and quantity of training examples.

翻译：图形绘制常被用作痴呆筛查方案的一部分。欧洲健康、老龄化和退休调查（SHARE）采纳了阿登布鲁克认知检查III中的三项绘图测试，作为其认知问卷模块的内容。虽然绘图通常由经过培训的临床医生评分，但SHARE在实地调查中采用面对面访谈员进行评分。这可能对数据质量构成风险，因为访谈员缺乏临床培训，其评分一致性可能较低且更易出错。因此，本文首次验证概念可行性，评估使用深度学习实现自动化评分的可能性。我们利用德国第8轮SHARE面板数据中约2000份绘图样本、对应的访谈员评分以及自行开发的"黄金标准"评分，训练了多个不同的卷积神经网络（CNN）模型。结果表明该方法确实可行。相较于基于访谈员评分训练的模型，基于黄金标准数据训练的模型预测准确率提升约10个百分点。表现最佳的ConvNeXt Base模型准确率达85%，比访谈员评分准确率高5个百分点。尽管这一结果令人鼓舞，但模型仍难准确评定部分正确的绘图——这类问题同样困扰着访谈员。这表明要达到生产级预测精度，需要更多高质量训练数据。因此我们讨论了提升训练样本质量与数量的后续可行方案。