Background: Large language models have demonstrated strong performance on general medical examinations, but subspecialty clinical reasoning remains challenging due to rapidly evolving guidelines and nuanced evidence hierarchies. Methods: We evaluated January Mirror, an evidence-grounded clinical reasoning system, against frontier LLMs (GPT-5, GPT-5.2, Gemini-3-Pro) on a 120-question endocrinology board-style examination. Mirror integrates a curated endocrinology and cardiometabolic evidence corpus with a structured reasoning architecture to generate evidence-linked outputs. Mirror operated under a closed-evidence constraint without external retrieval. Comparator LLMs had real-time web access to guidelines and primary literature. Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%). On the 30 most difficult questions (human accuracy less than 50%), Mirror achieved 76.7% accuracy. Top-2 accuracy was 92.5% for Mirror versus 85.25% for GPT-5.2. Conclusions: Mirror provided evidence traceability: 74.2% of outputs cited at least one guideline-tier source, with 100% citation accuracy on manual verification. Curated evidence with explicit provenance can outperform unconstrained web retrieval for subspecialty clinical reasoning and supports auditability for clinical deployment.
翻译:背景:大型语言模型在普通医学考试中展现出强大性能,但由于指南快速更新和证据等级体系存在细微差别,亚专科临床推理仍具挑战性。方法:我们在包含120道题的内分泌学专科委员会式考试中,将证据驱动的临床推理系统January Mirror与前沿大型语言模型(GPT-5、GPT-5.2、Gemini-3-Pro)进行对比评估。Mirror系统通过整合精选的内分泌与心脏代谢证据语料库和结构化推理架构,生成证据关联的输出结果。Mirror在封闭证据约束下运行,未进行外部检索。对比的大型语言模型则具备实时访问指南和原始文献的网络权限。结果:Mirror系统准确率达到87.5%(105/120;95%置信区间:80.4-92.3%),超越人类参考基准62.3%及前沿大型语言模型包括GPT-5.2(74.6%)、GPT-5(74.0%)和Gemini-3-Pro(69.8%)。在30道最难题型(人类准确率低于50%)中,Mirror准确率达76.7%。Mirror系统的前二准确率为92.5%,而GPT-5.2为85.25%。结论:Mirror系统具备证据可追溯性:74.2%的输出结果引用了至少一个指南级证据源,经人工核验的引用准确率达100%。具有明确来源的精选证据体系在亚专科临床推理中可超越无约束网络检索,并为临床部署提供可审计性支持。