Aligning Medical Conversational AI through Online Reinforcement Learning with Information-Theoretic Rewards

We present Information Gain Fine-Tuning (IGFT), a novel approach for training medical conversational AI to conduct effective patient interviews and generate comprehensive History of Present Illness (HPI) without requiring pre-collected human conversations. IGFT combines online Group Relative Policy Optimization (GRPO) with information-theoretic rewards, enabling models to learn from self-generated conversations with simulated patients. Unlike existing approaches that rely on expensive expert-annotated conversations or static datasets, our online RL framework allows models to discover effective questioning strategies through exploration. Our key innovation is an information gain reward function that tracks which clinical entities such as symptoms, temporal patterns, and medical history, are revealed during conversation. Each question's reward is computed based on its expected information gain combined with GPT-4o-mini quality assessments across dimensions including clinical relevance, patient engagement, and specificity. This hybrid approach ensures models learn to ask targeted, clinically appropriate questions that efficiently gather diagnostic information. We fine-tune two models using LoRA: Llama-3.1-8B-Instruct and DeepSeek-R1-Distill-Qwen-7B (a reasoning-optimized model). Training exclusively on Avey data containing concise HPIs, we evaluate generalization to MIMIC data with longer, more elaborate HPIs. DeepSeek-R1-Distill-Qwen-7B (IGFT) achieves F1 scores of 0.408 on Avey (10.9% improvement over base) and 0.289 on MIMIC (12.9% improvement), while Llama-3.1-8B-Instruct (IGFT) reaches 0.384 and 0.336 respectively. Both models outperform OpenAI's model on MIMIC and surpass medical domain-specific baselines like HuatuoGPT and UltraMedical, which were optimized for single-turn medical QA rather than multi-turn conversations.

翻译：本文提出信息增益微调（IGFT），一种无需预先收集人类对话即可训练医疗对话人工智能进行有效患者访谈并生成全面现病史（HPI）的新方法。IGFT将在线组相对策略优化（GRPO）与信息论奖励相结合，使模型能够从与模拟患者的自生成对话中学习。与依赖昂贵专家标注对话或静态数据集的现有方法不同，我们的在线强化学习框架允许模型通过探索发现有效的提问策略。我们的核心创新是一个信息增益奖励函数，用于追踪对话过程中揭示的临床实体，如症状、时间模式和病史。每个问题的奖励基于其期望信息增益与GPT-4o-mini在临床相关性、患者参与度和特异性等维度的质量评估相结合计算。这种混合方法确保模型学会提出有针对性的、临床适宜的问题，从而高效收集诊断信息。我们使用LoRA微调了两个模型：Llama-3.1-8B-Instruct和DeepSeek-R1-Distill-Qwen-7B（一种推理优化模型）。仅在包含简洁HPI的Avey数据上进行训练，我们评估了模型对具有更长、更详尽HPI的MIMIC数据的泛化能力。DeepSeek-R1-Distill-Qwen-7B（IGFT）在Avey上达到0.408的F1分数（较基础模型提升10.9%），在MIMIC上达到0.289（提升12.9%），而Llama-3.1-8B-Instruct（IGFT）分别达到0.384和0.336。两个模型在MIMIC上的表现均优于OpenAI的模型，并超越了如HuatuoGPT和UltraMedical等医疗领域专用基线模型，后者是针对单轮医疗问答而非多轮对话优化的。