Recent developments in large language models (LLMs) have unlocked new opportunities for healthcare, from information synthesis to clinical decision support. These new LLMs are not just capable of modeling language, but can also act as intelligent "agents" that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model's ability to process clinical data or answer standardized test questions, LLM agents should be assessed for their performance on real-world clinical tasks. These new evaluation frameworks, which we call "Artificial-intelligence Structured Clinical Examinations" ("AI-SCI"), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars. High-fidelity simulations may also be used to evaluate interactions between users and LLMs within a clinical workflow, or to model the dynamic interactions of multiple LLMs. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents into healthcare.
翻译:近期大语言模型的发展为医疗健康领域带来了新机遇,涵盖从信息整合到临床决策支持等场景。这些新型大语言模型不仅具备语言建模能力,更能作为智能"智能体"参与开放式对话互动,甚至影响临床决策过程。相较于依赖处理临床数据或解答标准化试题的传统评估基准,大语言模型智能体应通过真实临床任务的执行效能进行考核。我们提出的新型评估框架——"人工智能结构化临床考试"(AI-SCI),可借鉴自动驾驶等具有不同自主程度的同类技术评估体系。高保真度模拟可用于评估用户与大语言模型在临床工作流程中的交互,或模拟多个大语言模型间的动态协作。构建这种稳健的真实临床评估体系,对于推动大语言模型智能体在医疗领域的实际部署至关重要。