Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.
翻译:临床决策支持系统(CDSS)需要具备可审查、可审计的流水线,以支持严格且可复现的验证。然而,当前基于大语言模型的CDSS在很大程度上仍不透明。大多数“开放”模型仅开放权重,即在发布参数的同时,隐瞒了决定模型行为的数据来源、整理流程和生成流水线。医学领域目前尚不存在完全开放(FO)模型,即端到端暴露完整训练流程的模型。我们提出完全开放Meditron,这是首个用于构建大语言模型CDSS的完全开放流水线,包含经临床医生审计的训练语料库、可复现的数据构建与训练框架,以及与应用对齐的评估协议。该语料库将八个公开医学问答数据集统一为标准化对话格式,并通过三个经临床医生验证的合成扩展集扩大覆盖范围:考试式问答、源自46,469条临床实践指南的指南问答,以及临床病例。该流水线实现了系统级去污染、教师模型生成结果的金标签重采样,以及由四名医生组成的专家小组进行的端到端验证。我们采用基于大语言模型裁判的协议,对专家编写的临床病例进行评估,并针对204名人类评分者进行校准。我们将该方案应用于五个FO基础模型(Apertus-70B/8B-Instruct、OLMo-2-32B-SFT、EuroLLM-22B/9B-Instruct)。所有MeditronFO变体均优于其基础模型。Apertus-70B-MeditronFO在医学综合基准测试中较基础模型提升+6.6个百分点(从47.2%提升至53.8%),创下FO新最优结果。在58.6%的大语言模型裁判比较中,Gemma-3-27B-MeditronFO优于MedGemma,并在HealthBench上表现更佳(58%对比55.9%)。这些结果表明,完全开放流水线可在不牺牲可审计性或可复现性的前提下实现领域特定最优性能。