Since the inception of the Transformer architecture in 2017, Large Language Models (LLMs) such as GPT and BERT have evolved significantly, impacting various industries with their advanced capabilities in language understanding and generation. These models have shown potential to transform the medical field, highlighting the necessity for specialized evaluation frameworks to ensure their effective and ethical deployment. This comprehensive survey delineates the extensive application and requisite evaluation of LLMs within healthcare, emphasizing the critical need for empirical validation to fully exploit their capabilities in enhancing healthcare outcomes. Our survey is structured to provide an in-depth analysis of LLM applications across clinical settings, medical text data processing, research, education, and public health awareness. We begin by exploring the roles of LLMs in different medical applications, detailing how they are evaluated based on their performance in tasks such as clinical application, medical text data processing, information retrieval, data analysis, medical scientific writing, educational content generation etc. The subsequent sections delve into the methodologies employed in these evaluations, discussing the benchmarks and metrics used to assess the models' effectiveness, accuracy, and ethical alignment. Through this survey, we aim to equip healthcare professionals, researchers, and policymakers with a comprehensive understanding of the potential strengths and limitations of LLMs in medical applications. By providing detailed insights into the evaluation processes and the challenges faced in integrating LLMs into healthcare, this survey seeks to guide the responsible development and deployment of these powerful models, ensuring they are harnessed to their full potential while maintaining stringent ethical standards.
翻译:自2017年Transformer架构问世以来,以GPT和BERT为代表的大型语言模型(LLMs)在语言理解与生成能力方面取得显著进展,深刻影响了多个行业。这些模型展现出变革医疗领域的潜力,亟需建立专门的评估框架以确保其有效且合乎伦理的应用。本综述全面梳理了LLMs在医疗健康领域的广泛应用与必要评估路径,强调通过实证验证充分释放其提升医疗效果的潜力。本文以深度分析为框架,系统探讨LLMs在临床场景、医学文本数据处理、科研、教育及公共卫生意识提升中的应用。首先剖析LLMs在不同医疗场景中的角色,详述其在临床诊疗、医学文本数据处理、信息检索、数据分析、医学科学写作、教育内容生成等任务中的评估方式。后续章节深入解析评估方法论,讨论用于衡量模型有效性、准确性与伦理合规性的基准测试与指标。通过本综述,我们旨在为医疗专业人员、研究人员及政策制定者提供对LLMs在医疗应用中潜在优势与局限性的全面认知。通过详细阐述评估流程及将LLMs融入医疗体系所面临的挑战,本综述致力于指导这些强大模型的负责任开发与部署,确保在维持严格伦理标准的同时充分发挥其全部潜力。