Large Language Models (LLMs) are increasingly used in healthcare, yet ensuring their safety and trustworthiness remains a barrier to deployment. Conversational medical assistants must avoid unsafe compliance without over-refusing benign queries. We present an iterative post-deployment alignment framework that applies Kahneman-Tversky Optimization (KTO) and Direct Preference Optimization (DPO) to refine models against domain-specific safety signals. Using the CARES-18K benchmark for adversarial robustness, we evaluate four LLMs (Llama-3B/8B, Meditron-8B, Mistral-7B) across multiple cycles. Our results show up to 42% improvement in safety-related metrics for harmful query detection, alongside interesting trade-offs against erroneous refusals, thereby exposing architecture-dependent calibration biases. We also perform ablation studies to identify when self-evaluation is reliable and when external or finetuned judges are necessary to maximize performance gains. Our findings underscore the importance of adopting best practices that balance patient safety, user trust, and clinical utility in the design of conversational medical assistants.
翻译:大型语言模型(LLMs)在医疗领域的应用日益广泛,但其安全性与可信度仍是部署的主要障碍。对话式医疗助手需避免不安全依从,同时不过度拒绝良性查询。本文提出一种迭代部署后对齐框架,应用卡尼曼-特沃斯基优化(KTO)与直接偏好优化(DPO),针对领域特定安全信号优化模型。基于对抗鲁棒性基准CARES-18K,我们对四种LLMs(Llama-3B/8B、Meditron-8B、Mistral-7B)进行多轮评估。结果显示,有害查询检测的安全相关指标提升高达42%,同时揭示了与错误拒绝之间的权衡关系,从而暴露了架构依赖的校准偏差。我们还通过消融实验,确定了何时自我评估可靠、何时需要外部或微调评估器以最大化性能增益。研究结果强调了在对话式医疗助手设计中,采用平衡患者安全、用户信任与临床效用的最佳实践的重要性。