We investigate whether general-domain large language models such as GPT-4 Turbo can perform risk stratification and predict post-operative outcome measures using a description of the procedure and a patient's clinical notes derived from the electronic health record. We examine predictive performance on 8 different tasks: prediction of ASA Physical Status Classification, hospital admission, ICU admission, unplanned admission, hospital mortality, PACU Phase 1 duration, hospital duration, and ICU duration. Few-shot and chain-of-thought prompting improves predictive performance for several of the tasks. We achieve F1 scores of 0.50 for ASA Physical Status Classification, 0.81 for ICU admission, and 0.86 for hospital mortality. Performance on duration prediction tasks were universally poor across all prompt strategies. Current generation large language models can assist clinicians in perioperative risk stratification on classification tasks and produce high-quality natural language summaries and explanations.
翻译:我们研究了通用领域大型语言模型(如GPT-4 Turbo)能否通过手术描述及从电子健康记录中提取的患者临床笔记,执行风险分层并预测术后结局指标。我们在8项不同任务上评估了预测性能:ASA身体状况分级、入院、ICU入住、非计划入院、院内死亡、PACU第一阶段时长、住院时长及ICU时长。少样本提示与思维链提示在多项任务中提升了预测性能。我们取得ASA身体状况分级的F1得分为0.50、ICU入住为0.81、院内死亡为0.86。在所有提示策略中,时长预测任务的性能普遍较差。当前一代大型语言模型能在分类任务上辅助临床医生进行围手术期风险分层,并生成高质量的自然语言摘要与解释。