While large language models hold promise for complex medical applications, their development is hindered by the scarcity of high-quality reasoning data. To address this issue, existing approaches typically distill chain-of-thought reasoning traces from large proprietary models via supervised fine-tuning, then conduct reinforcement learning (RL). These methods exhibit limited improvement on underrepresented domains like rare diseases while incurring substantial costs from generating complex reasoning chains. To efficiently enhance medical reasoning, we propose MedSSR, a Medical Knowledge-enhanced data Synthesis and Semi-supervised Reinforcement learning framework. Our framework first employs rare disease knowledge to synthesize distribution-controllable reasoning questions. We then utilize the policy model itself to generate high-quality pseudo-labels. This enables a two-stage, intrinsic-to-extrinsic training paradigm: self-supervised RL on the pseudo-labeled synthetic data, followed by supervised RL on the human-annotated real data. MedSSR scales model training efficiently without relying on costly trace distillation. Extensive experiments on Qwen and Llama demonstrate that our method outperforms existing methods across ten medical benchmarks, achieving up to +5.93% gain on rare-disease tasks. Our code is available at https://github.com/tdlhl/MedSSR.
翻译:尽管大型语言模型在复杂医学应用中展现出巨大潜力,但其发展受限于高质量推理数据的稀缺性。为解决这一问题,现有方法通常通过监督微调从大型专有模型中提炼思维链推理轨迹,随后进行强化学习。这些方法在罕见病等代表性不足的领域改进有限,同时因生成复杂推理链而产生高昂成本。为高效增强医学推理能力,我们提出MedSSR框架——一种融合医学知识的数据合成与半监督强化学习框架。该框架首先利用罕见病知识合成分布可控的推理问题,随后利用策略模型自身生成高质量伪标签,从而构建"内在-外显"两阶段训练范式:对伪标签合成数据进行自监督强化学习,再对人工标注真实数据进行监督强化学习。MedSSR无需依赖昂贵的轨迹提炼即可高效扩展模型训练。在Qwen和Llama上的广泛实验表明,我们的方法在十项医学基准测试中均优于现有方法,在罕见病任务上实现最高+5.93%的性能提升。我们的代码已开源至https://github.com/tdlhl/MedSSR。