Large language models (LLMs) offer transformative potential for clinical decision support in spine surgery but pose significant risks through hallucinations, which are factually inconsistent or contextually misaligned outputs that may compromise patient safety. This study introduces a clinician-centered framework to quantify hallucination risks by evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. We assessed six leading LLMs across 30 expert-validated spinal cases. DeepSeek-R1 demonstrated superior overall performance (total score: 86.03 $\pm$ 2.08), particularly in high-stakes domains such as trauma and infection. A critical finding reveals that reasoning-enhanced model variants did not uniformly outperform standard counterparts: Claude-3.7-Sonnet's extended thinking mode underperformed relative to its standard version (80.79 $\pm$ 1.83 vs. 81.56 $\pm$ 1.92), indicating extended chain-of-thought reasoning alone is insufficient for clinical reliability. Multidimensional stress-testing exposed model-specific vulnerabilities, with recommendation quality degrading by 7.4% under amplified complexity. This decline contrasted with marginal improvements in rationality (+2.0%), readability (+1.7%) and diagnosis (+4.7%), highlighting a concerning divergence between perceived coherence and actionable guidance. Our findings advocate integrating interpretability mechanisms (e.g., reasoning chain visualization) into clinical workflows and establish a safety-aware validation framework for surgical LLM deployment.


翻译:大型语言模型(LLMs)在脊柱外科临床决策支持中展现出变革潜力,但因其产生的幻觉——即事实不一致或上下文错位的输出,可能危及患者安全——而带来显著风险。本研究提出一个以临床医生为中心的框架,通过评估诊断精度、推荐质量、推理稳健性、输出连贯性和知识对齐性来量化幻觉风险。我们在30个经专家验证的脊柱病例上评估了六种领先的LLMs。DeepSeek-R1表现出卓越的整体性能(总分:86.03 $\pm$ 2.08),尤其在创伤和感染等高风险领域。一个关键发现表明,增强推理能力的模型变体并未普遍优于标准版本:Claude-3.7-Sonnet的扩展思维模式相对于其标准版本表现不佳(80.79 $\pm$ 1.83 vs. 81.56 $\pm$ 1.92),这表明仅靠扩展的思维链推理不足以确保临床可靠性。多维压力测试揭示了模型特定的脆弱性,在复杂度放大时推荐质量下降了7.4%。这种下降与理性(+2.0%)、可读性(+1.7%)和诊断(+4.7%)的边际改善形成对比,突显了感知连贯性与可操作指导之间令人担忧的差异。我们的研究主张将可解释性机制(例如推理链可视化)整合到临床工作流程中,并为外科LLM部署建立一个安全感知的验证框架。

0
下载
关闭预览

相关内容

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文,这些论文构成了整个领域的进步,也欢迎介绍人工智能应用的论文,但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能,而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案,强调其新颖性,并对正在开发的人工智能技术进行深入的评估。 官网地址:http://dblp.uni-trier.de/db/journals/ai/
Top
微信扫码咨询专知VIP会员