Large language models (LLMs) offer transformative potential for clinical decision support in spine surgery but pose significant risks through hallucinations, which are factually inconsistent or contextually misaligned outputs that may compromise patient safety. This study introduces a clinician-centered framework to quantify hallucination risks by evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. We assessed six leading LLMs across 30 expert-validated spinal cases. DeepSeek-R1 demonstrated superior overall performance (total score: 86.03 $\pm$ 2.08), particularly in high-stakes domains such as trauma and infection. A critical finding reveals that reasoning-enhanced model variants did not uniformly outperform standard counterparts: Claude-3.7-Sonnet's extended thinking mode underperformed relative to its standard version (80.79 $\pm$ 1.83 vs. 81.56 $\pm$ 1.92), indicating extended chain-of-thought reasoning alone is insufficient for clinical reliability. Multidimensional stress-testing exposed model-specific vulnerabilities, with recommendation quality degrading by 7.4% under amplified complexity. This decline contrasted with marginal improvements in rationality (+2.0%), readability (+1.7%) and diagnosis (+4.7%), highlighting a concerning divergence between perceived coherence and actionable guidance. Our findings advocate integrating interpretability mechanisms (e.g., reasoning chain visualization) into clinical workflows and establish a safety-aware validation framework for surgical LLM deployment.
翻译:大语言模型(LLMs)在脊柱外科临床决策支持方面具有变革潜力,但因其产生的幻觉(即事实不一致或上下文错位的输出)可能危及患者安全而带来显著风险。本研究提出一种以临床医生为中心的框架,通过评估诊断精度、推荐质量、推理稳健性、输出连贯性及知识对齐性来量化幻觉风险。我们在30个经专家验证的脊柱病例中评估了六种主流LLMs。DeepSeek-R1展现出最优的综合性能(总分:86.03 ± 2.08),尤其在创伤和感染等高危领域表现突出。一项关键发现表明,增强推理能力的模型变体并未全面优于标准版本:Claude-3.7-Sonnet的扩展思维模式相较于其标准版本表现更差(80.79 ± 1.83 vs. 81.56 ± 1.92),这提示仅靠扩展的思维链推理不足以确保临床可靠性。多维压力测试揭示了模型特有的脆弱性,在复杂度放大条件下推荐质量下降7.4%。这一下降与理性(+2.0%)、可读性(+1.7%)和诊断(+4.7%)的边际改善形成对比,凸显了感知连贯性与可操作指导之间存在令人担忧的背离。我们的研究主张将可解释性机制(如推理链可视化)整合到临床工作流程中,并为手术LLM部署建立一个安全感知的验证框架。