A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI

Kirill Skobelev,Eric Fithian,Yegor Baranovski,Jack Cook,Sandeep Angara,Shauna Otto,Zhuang-Fang Yi,John Zhu,Daniel A. Donoho,X. Y. Han,Neeraj Mainkar,Margaux Masson-Forsythe

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but surgical benchmarks in particular are often missing from prominent medical benchmark suites (specifically, those requiring visual recognition). Since surgery requires integrating disparate tasks, generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

翻译：近年来，人工智能模型在多项生物医学基准任务中的表现已媲美甚至超越人类专家。然而，在主流医学基准测试集（尤其是需要视觉识别的任务）中，手术相关基准往往被忽略。由于手术需要整合多种不同任务，若其性能得以提升，具备通用能力的人工智能模型作为协作工具将极具吸引力。一方面，通过扩展架构规模与训练数据量的经典方法颇具前景，尤其考虑到每年会产生数百万小时的手术视频数据。另一方面，为人工智能训练准备手术数据需要显著更高的专业水平，而基于这些数据的训练则需要昂贵的计算资源。这些权衡因素使得现代人工智能能否以及能在多大程度上辅助手术实践的问题充满不确定性。本文通过一项案例分析探究该问题——使用截至2026年最前沿的人工智能方法进行手术器械检测。研究表明，即使采用数十亿参数规模的模型并进行大量训练，当前视觉语言模型在神经外科手术器械检测这一看似简单的任务中仍表现不佳。此外，我们通过缩放实验证明，增大模型规模与延长训练时间仅能带来相关性能指标的边际提升。因此，我们的实验表明，当前模型在手术应用场景中仍面临显著障碍。更关键的是，某些障碍无法单纯通过增加算力“缩放消除”，且在不同架构的模型中持续存在，这引发了一个疑问：数据和标签的可获得性是否仅是限制因素之一？我们系统分析了导致这些约束的主要因素，并提出了潜在的解决方案。