Predicting postoperative risk can inform effective care management & planning. We explored large language models (LLMs) in predicting postoperative risk through clinical texts using various tuning strategies. Records spanning 84,875 patients from Barnes Jewish Hospital (BJH) between 2018 & 2021, with a mean duration of follow-up based on the length of postoperative ICU stay less than 7 days, were utilized. Methods were replicated on the MIMIC-III dataset. Outcomes included 30-day mortality, pulmonary embolism (PE) & pneumonia. Three domain adaptation & finetuning strategies were implemented for three LLMs (BioGPT, ClinicalBERT & BioClinicalBERT): self-supervised objectives; incorporating labels with semi-supervised fine-tuning; & foundational modelling through multi-task learning. Model performance was compared using the AUROC & AUPRC for classification tasks & MSE & R2 for regression tasks. Cohort had a mean age of 56.9 (sd: 16.8) years; 50.3% male; 74% White. Pre-trained LLMs outperformed traditional word embeddings, with absolute maximal gains of 38.3% for AUROC & 14% for AUPRC. Adapting models through self-supervised finetuning further improved performance by 3.2% for AUROC & 1.5% for AUPRC Incorporating labels into the finetuning procedure further boosted performances, with semi-supervised finetuning improving by 1.8% for AUROC & 2% for AUPRC & foundational modelling improving by 3.6% for AUROC & 2.6% for AUPRC compared to self-supervised finetuning. Pre-trained clinical LLMs offer opportunities for postoperative risk predictions with unseen data, & further improvements from finetuning suggests benefits in adapting pre-trained models to note-specific perioperative use cases. Incorporating labels can further boost performance. The superior performance of foundational models suggests the potential of task-agnostic learning towards the generalizable LLMs in perioperative care.
翻译:预测术后风险可为有效的护理管理与规划提供信息。我们探索了通过不同调优策略利用临床文本的大语言模型(LLMs)进行术后风险预测。研究使用了2018年至2021年间来自巴恩斯-犹太医院(BJH)的84,875名患者的记录,平均随访时长基于术后重症监护病房(ICU)住院时间小于7天。方法在MIMIC-III数据集上进行了复现。结局指标包括30天死亡率、肺栓塞(PE)及肺炎。针对三种LLMs(BioGPT、ClinicalBERT和BioClinicalBERT)实施了三种领域自适应和微调策略:自监督目标、结合标签的半监督微调,以及通过多任务学习的基础模型构建。模型性能通过分类任务的AUROC与AUPRC以及回归任务的MSE与R²进行比较。队列平均年龄为56.9岁(标准差:16.8);男性占50.3%;白人占74%。预训练的LLMs优于传统词嵌入,AUROC和AUPRC的绝对最大提升分别达38.3%和14%。通过自监督微调自适应模型使AUROC和AUPRC进一步提升了3.2%和1.5%。将标签纳入微调过程进一步提升了性能:与自监督微调相比,半监督微调使AUROC提升1.8%、AUPRC提升2%,基础模型构建使AUROC提升3.6%、AUPRC提升2.6%。预训练的临床LLMs为基于未见数据的术后风险预测提供了机会,而微调的进一步改进表明自适应预训练模型至特定围手术期使用场景的益处。纳入标签能进一步提升性能。基础模型的优越性能表明,任务无关学习在围手术期护理中向通用化LLMs发展的潜力。