Aligning Large Language Models (LLMs) with high-stakes medical standards remains a significant challenge, primarily due to the dissonance between coarse-grained preference signals and the complex, multi-dimensional nature of clinical protocols. To bridge this gap, we introduce ProMedical, a unified alignment framework grounded in fine-grained clinical criteria. We first construct ProMedical-Preference-50k, a dataset generated via a human-in-the-loop pipeline that augments medical instructions with rigorous, physician-derived rubrics. Leveraging this corpus, we propose the Explicit Criteria Injection paradigm to train a multi-dimensional reward model. Unlike traditional scalar reward models, our approach explicitly disentangles safety constraints from general proficiency, enabling precise guidance during reinforcement learning. To rigorously validate this framework, we establish ProMedical-Bench, a held-out evaluation suite anchored by double-blind expert adjudication. Empirical evaluations demonstrate that optimizing the Qwen3-8B base model via ProMedical-RM-guided GRPO yields substantial gains, improving overall accuracy by 22.3% and safety compliance by 21.7%, effectively rivaling proprietary frontier models. Furthermore, the aligned policy generalizes robustly to external benchmarks, demonstrating performance comparable to state-of-the-art models on UltraMedical. We publicly release our datasets, reward models, and benchmarks to facilitate reproducible research in safety-aware medical alignment.
翻译:使大型语言模型(LLMs)与高风险医学标准对齐仍是一项重大挑战,这主要源于粗粒度偏好信号与临床协议复杂多维本质之间的失谐。为弥合这一差距,我们提出ProMedical——一个基于细粒度临床准则的统一对齐框架。我们首先构建了ProMedical-Preference-50k数据集,该数据集通过人机协同管线生成,利用由医师严格制定的评分标准对医学指令进行增强。基于此语料库,我们提出显式准则注入范式以训练多维奖励模型。与传统的标量奖励模型不同,我们的方法将安全约束与通用能力显式解耦,从而在强化学习过程中实现精准引导。为严格验证该框架,我们建立了ProMedical-Bench——一个由双盲专家评审支撑的保留评估套件。实证评估表明,通过ProMedical-RM引导的GRPO优化Qwen3-8B基座模型,可带来显著提升:总体准确率提高22.3%,安全合规性提高21.7%,有效媲美专有前沿模型。此外,对齐后的策略在外部基准测试中展现出稳健的泛化能力,在UltraMedical上取得与最先进模型相当的性能。我们公开发布数据集、奖励模型和基准测试,以促进安全感知医学对齐领域的可重复研究。