The SAGES Critical View of Safety Challenge: A Global Benchmark for AI-Assisted Surgical Quality Assessment

Deepak Alapatt,Jennifer Eckhoff,Zhiliang Lyu,Yutong Ban,Jean-Paul Mazellier,Sarah Choksi,Kunyi Yang,Po-Hsing Chiang,Noemi Zorzetti,Samuele Cannas,Daniel Neimark,Omri Bar,Amine Yamlahi,Jakob Hennighausen,Xiaohan Wang,Rui Li,Long Liang,Yuxian Wang,Saurabh Koju,Binod Bhattarai,Tim Jaspers,Zhehua Mao,Anjana Wijekoon,Jun Ma,Yinan Xu,Zhilong Weng,Ammar M. Okran,Hatem A. Rashwan,Boyang Shen,Kaixiang Yang,Yutao Zhang,Hao Wang,2024 CVS Challenge Consortium,Quanzheng Li,Filippo Filicori,Xiang Li,Pietro Mascagni,Daniel A. Hashimoto,Guy Rosman,Ozanan Meireles,Nicolas Padoy

from arxiv, 21 pages, 10 figures

Advances in artificial intelligence (AI) for surgical quality assessment promise to democratize access to expertise, with applications in training, guidance, and accreditation. This study presents the SAGES Critical View of Safety (CVS) Challenge, the first AI competition organized by a surgical society, using the CVS in laparoscopic cholecystectomy, a universally recommended yet inconsistently performed safety step, as an exemplar of surgical quality assessment. A global collaboration across 54 institutions in 24 countries engaged hundreds of clinicians and engineers to curate 1,000 videos annotated by 20 surgical experts according to a consensus-validated protocol. The challenge addressed key barriers to real-world deployment in surgery, including achieving high performance, capturing uncertainty in subjective assessment, and ensuring robustness to clinical variability. To enable this scale of effort, we developed EndoGlacier, a framework for managing large, heterogeneous surgical video and multi-annotator workflows. Thirteen international teams participated, achieving up to a 17% relative gain in assessment performance, over 80% reduction in calibration error, and a 17% relative improvement in robustness over the state-of-the-art. Analysis of results highlighted methodological trends linked to model performance, providing guidance for future research toward robust, clinically deployable AI for surgical quality assessment.

翻译：人工智能在手术质量评估领域的进展有望推动专业知识的普及化，在培训、指导和认证方面具有应用前景。本研究介绍了由外科学会首次组织的AI竞赛——SAGES安全关键视野挑战赛，以腹腔镜胆囊切除术中CVS（一项被普遍推荐但执行标准不一的安全步骤）作为手术质量评估的范例。来自24个国家54个机构的全球协作网络汇聚了数百名临床医生和工程师，共同构建了包含1000条手术视频的数据集，所有视频均由20位外科专家根据共识验证方案进行标注。本挑战赛针对手术场景实际部署的关键障碍提出了解决方案，包括实现高性能评估、捕捉主观评估中的不确定性、以及确保对临床变异性的鲁棒性。为支撑如此规模的工作，我们开发了EndoGlacier框架，用于管理大规模异构手术视频及多标注者工作流。13支国际参赛团队在评估性能上实现了最高17%的相对提升，校准误差降低超过80%，鲁棒性较现有最优方法提升17%。结果分析揭示了与模型性能相关的方法学趋势，为未来开发鲁棒且可临床部署的手术质量评估AI系统提供了研究指引。