Synthetic data generation using large language models (LLMs) demonstrates substantial promise in addressing biomedical data challenges and shows increasing adoption in biomedical research. This study systematically reviews recent advances in synthetic data generation for biomedical applications and clinical research, focusing on how LLMs address data scarcity, utility, and quality issues with different modalities. We conducted a scoping review following PRISMA-ScR guidelines and searched literature published between 2020 and 2025 through PubMed, ACM, Web of Science, and Google Scholar. A total of 59 studies were included based on relevance to synthetic data generation in biomedical contexts. Among the reviewed studies, the predominant data modalities were unstructured texts (78.0\%), tabular data (13.6\%), and multimodal sources (8.4\%). Common generation methods included LLM prompting (74.6\%), fine-tuning (20.3\%), and specialized models (5.1\%). Evaluations were heterogeneous: intrinsic metrics (27.1\%), human-in-the-loop assessments (44.1\%), and LLM-based evaluations (13.6\%). However, limitations and key barriers persist in data modalities, domain utility, resource and model accessibility, and standardized evaluation protocols. Future efforts may focus on developing standardized, transparent evaluation frameworks and expanding accessibility to support effective applications in biomedical research.
翻译:利用大型语言模型(LLM)生成合成数据在应对生物医学数据挑战方面展现出巨大潜力,并在生物医学研究中得到日益广泛的应用。本研究系统综述了合成数据生成在生物医学应用与临床研究中的最新进展,重点关注LLM如何应对不同模态下的数据稀缺性、效用与质量问题。我们遵循PRISMA-ScR指南开展了范围综述,通过PubMed、ACM、Web of Science和Google Scholar检索了2020年至2025年间发表的文献。根据与生物医学领域合成数据生成的相关性,共纳入59项研究。在综述的研究中,主要的数据模态包括非结构化文本(78.0%)、表格数据(13.6%)和多模态数据(8.4%)。常用的生成方法包括LLM提示工程(74.6%)、微调(20.3%)和专用模型(5.1%)。评估方法呈现异质性:内在指标评估(27.1%)、人机协同评估(44.1%)以及基于LLM的评估(13.6%)。然而,在数据模态、领域效用、资源与模型可及性以及标准化评估协议方面仍存在局限性与关键障碍。未来的工作可聚焦于开发标准化、透明的评估框架,并扩展可及性以支持生物医学研究中的有效应用。