Purpose: Most studies evaluating artificial intelligence (AI) models that detect abnormalities in neuroimaging are either tested on unrepresentative patient cohorts or are insufficiently well-validated, leading to poor generalisability to real-world tasks. The aim was to determine the diagnostic test accuracy and summarise the evidence supporting the use of AI models performing first-line, high-volume neuroimaging tasks. Methods: Medline, Embase, Cochrane library and Web of Science were searched until September 2021 for studies that temporally or externally validated AI capable of detecting abnormalities in first-line CT or MR neuroimaging. A bivariate random-effects model was used for meta-analysis where appropriate. PROSPERO: CRD42021269563. Results: Only 16 studies were eligible for inclusion. Included studies were not compromised by unrepresentative datasets or inadequate validation methodology. Direct comparison with radiologists was available in 4/16 studies. 15/16 had a high risk of bias. Meta-analysis was only suitable for intracranial haemorrhage detection in CT imaging (10/16 studies), where AI systems had a pooled sensitivity and specificity 0.90 (95% CI 0.85 - 0.94) and 0.90 (95% CI 0.83 - 0.95) respectively. Other AI studies using CT and MRI detected target conditions other than haemorrhage (2/16), or multiple target conditions (4/16). Only 3/16 studies implemented AI in clinical pathways, either for pre-read triage or as post-read discrepancy identifiers. Conclusion: The paucity of eligible studies reflects that most abnormality detection AI studies were not adequately validated in representative clinical cohorts. The few studies describing how abnormality detection AI could impact patients and clinicians did not explore the full ramifications of clinical implementation.
翻译:目的:大多数评估用于检测神经影像异常的AI模型的研究,要么在非代表性患者队列中进行测试,要么验证不充分,导致其在真实世界任务中的泛化能力较差。本研究旨在确定诊断性测试准确性,并总结支持AI模型执行一线、高容量神经影像任务的相关证据。方法:检索Medline、Embase、Cochrane图书馆及Web of Science数据库,时间截至2021年9月,筛选对能检测一线CT或MR神经影像异常的AI模型进行时间或外部验证的研究。在适当情况下,采用双变量随机效应模型进行荟萃分析。PROSPERO注册号:CRD42021269563。结果:仅16项研究符合纳入标准。纳入的研究未受非代表性数据集或不充分验证方法的影响。4/16项研究提供了与放射科医师的直接比较。15/16项研究存在高偏倚风险。仅CT影像中颅内出血检测适合进行荟萃分析(10/16项研究),AI系统的合并敏感度与特异度分别为0.90(95% CI 0.85-0.94)和0.90(95% CI 0.83-0.95)。其他使用CT和MRI的AI研究检测了出血以外的目标病症(2/16项研究),或多种目标病症(4/16项研究)。仅3/16项研究在临床路径中应用了AI,其用途包括预读分诊或后读差异识别。结论:合格研究的匮乏反映出大多数异常检测AI研究未在代表性临床队列中得到充分验证。少数描述异常检测AI如何影响患者及临床医生的研究,未能探索临床实施的全部影响。