Machine learning (ML) holds great promise for clinical applications but is often hindered by limited access to high-quality data due to privacy concerns, high costs, and long timelines associated with clinical trials. While large language models (LLMs) have demonstrated strong performance in general-purpose generation tasks, their application to synthesizing realistic clinical trials remains underexplored. In this work, we propose a novel Retrieval-Reasoning framework that leverages few-shot prompting with LLMs to generate synthetic clinical trial reports annotated with binary success/failure outcomes. Our approach integrates a retrieval module to ground the generation on relevant trial data and a reasoning module to ensure domain-consistent justifications. Experiments conducted on real clinical trials from the ClinicalTrials.gov database demonstrate that the generated synthetic trials effectively augment real datasets. Fine-tuning a BioBERT classifier on synthetic data, real data, or their combination shows that hybrid fine-tuning leads to improved performance on clinical trial outcome prediction tasks. Our results suggest that LLM-based synthetic data can serve as a powerful tool for privacy-preserving data augmentation in clinical research. The code is available at https://github.com/XuZR3x/Retrieval_Reasoning_Clinical_Trial_Generation.
翻译:机器学习在临床应用中潜力巨大,但常因隐私问题、高成本及临床试验周期漫长导致高质量数据获取受限而受阻。尽管大语言模型在通用生成任务中表现优异,但其在合成逼真临床试验报告方面的应用尚待深入探索。本研究提出一种新颖的检索推理框架,通过大语言模型的少样本提示方法生成标注有二元成功/失败结果的合成临床试验报告。该框架集成检索模块以关联相关试验数据作为生成基础,并引入推理模块确保生成内容具有领域一致性论证。基于ClinicalTrials.gov数据库真实临床试验的实验表明,生成的合成试验可有效扩充真实数据集。分别在合成数据、真实数据及其组合上微调BioBERT分类器的结果显示,混合微调能提升临床试验结果预测任务的性能。本研究表明,基于大语言模型的合成数据可作为临床研究中保护隐私的数据增强工具。代码已开源至https://github.com/XuZR3x/Retrieval_Reasoning_Clinical_Trial_Generation。