Who Should Have Surgery? A Comparative Study of GenAI vs Supervised ML for CRS Surgical Outcome Prediction

Artificial intelligence has reshaped medical imaging, yet the use of AI on clinical data for prospective decision support remains limited. We study pre-operative prediction of clinically meaningful improvement in chronic rhinosinusitis (CRS), defining success as a more than 8.9-point reduction in SNOT-22 at 6 months (MCID). In a prospectively collected cohort where all patients underwent surgery, we ask whether models using only pre-operative clinical data could have identified those who would have poor outcomes, i.e. those who should have avoided surgery. We benchmark supervised ML (logistic regression, tree ensembles, and an in-house MLP) against generative AI (ChatGPT, Claude, Gemini, Perplexity), giving each the same structured inputs and constraining outputs to binary recommendations with confidence. Our best ML model (MLP) achieves 85 % accuracy with superior calibration and decision-curve net benefit. GenAI models underperform on discrimination and calibration across zero-shot setting. Notably, GenAI justifications align with clinician heuristics and the MLP's feature importance, repeatedly highlighting baseline SNOT-22, CT/endoscopy severity, polyp phenotype, and physchology/pain comorbidities. We provide a reproducible tabular-to-GenAI evaluation protocol and subgroup analyses. Findings support an ML-first, GenAI- augmented workflow: deploy calibrated ML for primary triage of surgical candidacy, with GenAI as an explainer to enhance transparency and shared decision-making.

翻译：人工智能已重塑医学影像领域，但在临床数据上应用AI进行前瞻性决策支持仍较为有限。本研究针对慢性鼻窦炎（CRS）术前预测临床意义改善的问题展开研究，将治疗成功定义为术后6个月SNOT-22评分降低超过8.9分（最小临床重要差异）。在一个所有患者均接受手术的前瞻性收集队列中，我们探究仅使用术前临床数据的模型能否识别出预后不良的患者，即本应避免手术的群体。我们将监督式机器学习（逻辑回归、树集成模型及自主研发的多层感知机）与生成式AI（ChatGPT、Claude、Gemini、Perplexity）进行基准测试，为所有模型提供相同的结构化输入，并将输出约束为带有置信度的二元推荐。我们最佳的机器学习模型（多层感知机）达到85%的准确率，且具有更优的校准特性和决策曲线净收益。生成式AI模型在零样本设置下的判别能力和校准性能均表现欠佳。值得注意的是，生成式AI的决策依据与临床医生的经验法则及多层感知机的特征重要性高度吻合，均反复强调基线SNOT-22评分、CT/内镜严重程度、息肉表型以及心理/疼痛共病等因素。我们提供了可复现的表格数据至生成式AI的评估方案及亚组分析结果。研究结论支持采用“机器学习主导、生成式AI增强”的工作流程：部署经过校准的机器学习模型进行手术适应症的初级分诊，同时利用生成式AI作为解释工具以提升决策透明度并促进医患共同决策。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

《FUTURE-AI: 医学影像中可信人工智能的指导原则和共识建议》巴塞罗那大学等47页综述

专知会员服务

17+阅读 · 2022年7月28日

MIT等十余位作者最新成果：教人工智能问临床问题（含源码）

专知会员服务

22+阅读 · 2022年7月18日

Nature Medicine | AI与临床相结合，最新DECIDE-AI指南助力临床人工智能从开发到实施

专知会员服务

30+阅读 · 2022年5月22日

人工智能技术在口腔正畸诊疗中的应用研究进展

专知会员服务

15+阅读 · 2022年5月1日