Synthetic data generation with Large Language Models (LLMs) has emerged as a promising solution in the medical domain to mitigate data scarcity and privacy constraints. However, existing approaches remain constrained by their derivative nature, relying on real-world records, which pose privacy risks and distribution biases. Furthermore, current patient agents face the Stability-Plasticity Dilemma, struggling to maintain clinical consistency during dynamic inquiries. To address these challenges, we introduce Patient-Zero, a novel framework for ab initio patient simulation that requires no real medical records. Our Medically-Aligned Hierarchical Synthesis framework generates comprehensive and diverse patient records from abstract clinical guidelines via stratified attribute permutation. To support rigorous clinical interaction, we design a Dual-Track Cognitive Memory System to enable agents dynamically update memory while preserving logical consistency and persona adherence. Extensive evaluations show that Patient-Zero establishes a new state-of-the-art in both data quality and interaction fidelity. In human expert evaluations, senior licensed physicians judge our synthetic data to be statistically indistinguishable from real human-authored data and higher in clinical quality. Furthermore, downstream medical reasoning model trained on our synthetic dataset shows substantial performance gains (MedQA +24.0%; MMLU +14.5%), demonstrating the practical utility of our framework.
翻译:利用大型语言模型生成合成数据已成为医学领域应对数据稀缺和隐私限制的一种前景广阔的解决方案。然而,现有方法仍受限于其衍生性质,依赖真实世界记录,从而带来隐私风险和分布偏差。此外,现有患者智能体面临稳定性-可塑性困境,难以在动态问诊中保持临床一致性。为应对这些挑战,我们提出了患者零号,一种无需真实医疗记录即可进行从头患者模拟的新型框架。我们的医学对齐分层合成框架通过分层属性置换,从抽象的临床指南生成全面且多样的患者记录。为支持严谨的临床交互,我们设计了双轨认知记忆系统,使智能体能够动态更新记忆,同时保持逻辑一致性和角色依从性。广泛的评估表明,患者零号在数据质量和交互保真度两方面均确立了新的最先进水平。在人类专家评估中,资深执业医师判断我们的合成数据在统计上与真实人类撰写的数据无法区分,且临床质量更高。此外,基于我们合成数据集训练的下游医学推理模型显示出显著的性能提升(MedQA +24.0%;MMLU +14.5%),证明了我们框架的实际效用。