Who Benefits From Sinus Surgery? Comparing Generative AI and Supervised Machine Learning for Predicting Surgical Outcomes in Chronic Rhinosinusitis

Artificial intelligence has reshaped medical imaging, yet the use of AI on clinical data for prospective decision support remains limited. We study pre-operative prediction of clinically meaningful improvement in chronic rhinosinusitis (CRS), defining success as a more than 8.9-point reduction in SNOT-22 at 6 months (MCID). In a prospectively collected cohort where all patients underwent surgery, we ask whether models using only pre-operative clinical data could have identified those who would have poor outcomes, i.e. those who should have avoided surgery. We benchmark supervised ML (logistic regression, tree ensembles, and an in-house MLP) against generative AI (ChatGPT, Claude, Gemini, Perplexity), giving each the same structured inputs and constraining outputs to binary recommendations with confidence. Our best ML model (MLP) achieves 85 % accuracy with superior calibration and decision-curve net benefit. GenAI models underperform on discrimination and calibration across zero-shot setting. Notably, GenAI justifications align with clinician heuristics and the MLP's feature importance, repeatedly highlighting baseline SNOT-22, CT/endoscopy severity, polyp phenotype, and physchology/pain comorbidities. We provide a reproducible tabular-to-GenAI evaluation protocol and subgroup analyses. Findings support an ML-first, GenAI- augmented workflow: deploy calibrated ML for primary triage of surgical candidacy, with GenAI as an explainer to enhance transparency and shared decision-making.

翻译：人工智能已重塑医学影像领域，然而基于临床数据的人工智能在前瞻性决策支持中的应用仍有限。本研究探讨慢性鼻窦炎（CRS）术前临床意义改善的预测问题，将手术成功定义为术后6个月SNOT-22评分降低超过8.9分（最小临床重要差异）。在一个所有患者均接受手术的前瞻性收集队列中，我们探究仅使用术前临床数据的模型能否识别出预后不良（即本应避免手术）的患者。我们将监督机器学习（逻辑回归、树集成模型及自主研发的多层感知机）与生成式人工智能（ChatGPT、Claude、Gemini、Perplexity）进行基准比较，为所有模型提供相同的结构化输入，并将输出约束为带有置信度的二元推荐。我们最佳的机器学习模型（多层感知机）达到85%的准确率，且具有更优的校准特性和决策曲线净收益。生成式人工智能模型在零样本设置下的区分度和校准表现均欠佳。值得注意的是，生成式人工智能的决策依据与临床经验法则及多层感知机的特征重要性高度吻合，均反复强调基线SNOT-22评分、CT/内镜严重程度、息肉表型以及心理/疼痛共病因素。我们提供了可复现的表格数据至生成式人工智能评估流程及亚组分析。研究结果支持采用"机器学习主导、生成式人工智能增强"的工作流程：部署校准后的机器学习模型进行手术适应症的初级分诊，并利用生成式人工智能作为解释工具以提升决策透明度和医患共同决策水平。