Recent advances in generative models, including large language models (LLMs), vision language models (VLMs), and diffusion models, have accelerated the field of natural language and image processing in medicine and marked a significant paradigm shift in how biomedical models can be developed and deployed. While these models are highly adaptable to new tasks, scaling and evaluating their usage presents new challenges not addressed in previous frameworks. In particular, the ability of these models to produce useful outputs with little to no specialized training data ("zero-" or "few-shot" approaches), as well as the open-ended nature of their outputs, necessitate the development of updated guidelines in using and evaluating these models. In response to gaps in standards and best practices for the development of clinical AI tools identified by US Executive Order 141103 and several emerging national networks for clinical AI evaluation, we begin to formalize some of these guidelines by building on the "Minimum information about clinical artificial intelligence modeling" (MI-CLAIM) checklist. The MI-CLAIM checklist, originally developed in 2020, provided a set of six steps with guidelines on the minimum information necessary to encourage transparent, reproducible research for artificial intelligence (AI) in medicine. Here, we propose modifications to the original checklist that highlight differences in training, evaluation, interpretability, and reproducibility of generative models compared to traditional AI models for clinical research. This updated checklist also seeks to clarify cohort selection reporting and adds additional items on alignment with ethical standards.
翻译:近期生成模型(包括大语言模型、视觉语言模型和扩散模型)的进展加速了医学自然语言与图像处理领域的发展,并标志着生物医学模型开发与部署方式的重大范式转变。尽管这些模型对新任务具有高度适应性,但其规模化应用与评估仍面临既有框架未涵盖的新挑战。具体而言,这些模型在无需或仅需少量专业训练数据即可生成有用输出("零样本"或"少样本"方法)的能力,以及其输出的开放式特性,亟需制定更新版的使用与评估指南。针对美国第141103号行政令及多个新兴临床AI评估国家级网络所识别的临床AI工具开发标准与最佳实践缺口,我们以"临床人工智能建模最低信息标准"核查清单为基础,开始系统化制定部分指南。该核查清单最初于2020年制定,包含六个步骤的指南,旨在规范医学人工智能研究中促进透明、可复现研究所需的最低信息。本文对原始核查清单提出修订,重点阐明生成模型与传统临床研究AI模型在训练、评估、可解释性及可复现性方面的差异。更新后的核查清单同时完善了队列选择报告规范,并增加了伦理合规性相关条目。