Exploring zero-shot capability of large language models in inferences from medical oncology notes

Both medical care and observational studies in oncology require a thorough understanding of a patient's disease progression and treatment history, often elaborately documented in clinical notes. Despite their vital role, no current oncology information representation and annotation schema fully encapsulates the diversity of information recorded within these notes. Although large language models (LLMs) have recently exhibited impressive performance on various medical natural language processing tasks, due to the current lack of comprehensively annotated oncology datasets, an extensive evaluation of LLMs in extracting and reasoning with the complex rhetoric in oncology notes remains understudied. We developed a detailed schema for annotating textual oncology information, encompassing patient characteristics, tumor characteristics, tests, treatments, and temporality. Using a corpus of 40 de-identified breast and pancreatic cancer progress notes at University of California, San Francisco, we applied this schema to assess the abilities of three recently-released LLMs (GPT-4, GPT-3.5-turbo, and FLAN-UL2) to perform zero-shot extraction of detailed oncological history from two narrative sections of clinical progress notes. Our team annotated 9028 entities, 9986 modifiers, and 5312 relationships. The GPT-4 model exhibited overall best performance, with an average BLEU score of 0.68, an average ROUGE score of 0.71, and an average accuracy of 67% on complex tasks (expert manual evaluation on subset). Notably, it was proficient in tumor characteristic and medication extraction, and demonstrated superior performance in advanced tasks of inferring symptoms due to cancer and considerations of future medications. GPT-4 may already be usable to extract important facts from cancer progress notes needed for clinical research, complex population management, and documenting quality patient care.

翻译：医学诊疗与肿瘤学观察性研究均需全面了解患者的疾病进展及治疗史，此类信息通常详尽记录于临床病程中。然而，现有肿瘤信息表征与标注模式尚未能完全涵盖病程记录中的多样化信息。尽管大语言模型近期在各类医学自然语言处理任务中展现出卓越性能，但由于当前缺乏全面标注的肿瘤数据集，针对其在肿瘤病程复杂修辞中提取与推理能力的系统性评估仍属空白。本研究开发了一套细粒度文本肿瘤信息标注框架，涵盖患者特征、肿瘤特征、检验检查、治疗方案及时序关系。基于加州大学旧金山分校40份脱敏的乳腺癌与胰腺癌病程记录语料，我们应用该框架评估了三种最新发布的大语言模型（GPT-4、GPT-3.5-turbo与FLAN-UL2）在临床病程两个叙述性段落中零样本提取详细肿瘤病史的能力。研究团队共标注9028个实体、9986个修饰语及5312种关系。GPT-4模型整体表现最优，其平均BLEU值为0.68，平均ROUGE值为0.71，复杂任务（专家对子集进行人工评估）平均准确率达67%。值得注意的是，该模型在肿瘤特征与用药信息提取方面表现突出，并在推断癌症相关症状及未来用药考量等高级任务中展现出卓越性能。GPT-4已具备从癌症病程记录中提取临床研究、复杂人群管理及高质量患者护理记录所需关键事实的实用潜力。