CHiLL: Zero-shot Custom Interpretable Feature Extraction from Clinical Notes with Large Language Models

Large Language Models (LLMs) have yielded fast and dramatic progress in NLP, and now offer strong few- and zero-shot capabilities on new tasks, reducing the need for annotation. This is especially exciting for the medical domain, in which supervision is often scant and expensive. At the same time, model predictions are rarely so accurate that they can be trusted blindly. Clinicians therefore tend to favor "interpretable" classifiers over opaque LLMs. For example, risk prediction tools are often linear models defined over manually crafted predictors that must be laboriously extracted from EHRs. We propose CHiLL (Crafting High-Level Latents), which uses LLMs to permit natural language specification of high-level features for linear models via zero-shot feature extraction using expert-composed queries. This approach has the promise to empower physicians to use their domain expertise to craft features which are clinically meaningful for a downstream task of interest, without having to manually extract these from raw EHR (as often done now). We are motivated by a real-world risk prediction task, but as a reproducible proxy, we use MIMIC-III and MIMIC-CXR data and standard predictive tasks (e.g., 30-day readmission) to evaluate our approach. We find that linear models using automatically extracted features are comparably performant to models using reference features, and provide greater interpretability than linear models using "Bag-of-Words" features. We verify that learned feature weights align well with clinical expectations.

翻译：摘要：大语言模型（LLMs）在自然语言处理领域取得了快速且显著的进展，如今在新任务上展现出强大的少样本和零样本能力，减少了对标注的需求。这在医疗领域尤为令人振奋，因为该领域的监督信号往往稀缺且昂贵。然而，模型预测的准确性很少能达到完全可信赖的程度。因此，临床医生倾向于偏好“可解释”分类器，而非不透明的LLMs。例如，风险预测工具通常是基于人工构建的预测因子定义的线性模型，而这些预测因子需要从电子健康记录（EHRs）中费力提取。我们提出CHiLL（构建高级隐变量），该方法利用LLMs通过专家编写的查询进行零样本特征提取，从而允许以自然语言描述线性模型的高级特征。这一方法有望使医生能够利用其领域专业知识构建对下游任务具有临床意义的特征，而无需像当前常见做法那样从原始EHR中手动提取。我们的研究受实际风险预测任务驱动，但为了确保可复现性，我们使用MIMIC-III和MIMIC-CXR数据及标准预测任务（如30天再入院）来评估该方法。实验发现，使用自动提取特征的线性模型与使用参考特征的模型性能相当，且相比基于“词袋”特征的线性模型提供了更强的可解释性。我们验证了学习到的特征权重与临床预期高度吻合。