Quantitative organ assessment is an essential step in automated abdominal disease diagnosis and treatment planning. Artificial intelligence (AI) has shown great potential to automatize this process. However, most existing AI algorithms rely on many expert annotations and lack a comprehensive evaluation of accuracy and efficiency in real-world multinational settings. To overcome these limitations, we organized the FLARE 2022 Challenge, the largest abdominal organ analysis challenge to date, to benchmark fast, low-resource, accurate, annotation-efficient, and generalized AI algorithms. We constructed an intercontinental and multinational dataset from more than 50 medical groups, including Computed Tomography (CT) scans with different races, diseases, phases, and manufacturers. We independently validated that a set of AI algorithms achieved a median Dice Similarity Coefficient (DSC) of 90.0\% by using 50 labeled scans and 2000 unlabeled scans, which can significantly reduce annotation requirements. The best-performing algorithms successfully generalized to holdout external validation sets, achieving a median DSC of 89.5\%, 90.9\%, and 88.3\% on North American, European, and Asian cohorts, respectively. They also enabled automatic extraction of key organ biology features, which was labor-intensive with traditional manual measurements. This opens the potential to use unlabeled data to boost performance and alleviate annotation shortages for modern AI models.
翻译:器官量化评估是自动化腹部疾病诊断和治疗规划中的关键步骤。人工智能在自动化这一过程中展现出巨大潜力。然而,现有的大多数人工智能算法依赖大量专家标注,且缺乏在真实跨国多中心环境下对准确性及效率的全面评估。为克服这些局限,我们组织了FLARE 2022挑战赛(迄今规模最大的腹部器官分析挑战赛),旨在对快速、低资源消耗、准确、标注高效且具备泛化能力的人工智能算法进行基准测试。我们构建了来自50多个医疗机构的跨洲跨国数据集,包含不同种族、疾病、扫描时相及制造商的计算机断层扫描数据。通过独立验证,一套人工智能算法仅使用50例标注扫描及2000例无标注扫描便实现了中位Dice相似系数达90.0%,这显著降低了对标注的需求。性能最佳的算法成功泛化至独立外部验证集,在北美、欧洲和亚洲队列中分别实现了中位DSC为89.5%、90.9%和88.3%。这些算法还实现了关键器官生物学特征的自动提取,而传统人工测量需耗费大量劳动力。这揭示了利用无标注数据提升现代人工智能模型性能并缓解标注短缺问题的潜力。