Low-resource classification of mobility functioning information in clinical sentences using large language models

Objective: Function is increasingly recognized as an important indicator of whole-person health. This study evaluates the ability of publicly available large language models (LLMs) to accurately identify the presence of functioning information from clinical notes. We explore various strategies to improve the performance on this task. Materials and Methods: We collect a balanced binary classification dataset of 1000 sentences from the Mobility NER dataset, which was curated from n2c2 clinical notes. For evaluation, we construct zero-shot and few-shot prompts to query the LLMs whether a given sentence contains mobility functioning information. Two sampling techniques, random sampling and k-nearest neighbor (kNN)-based sampling, are used to select the few-shot examples. Furthermore, we apply a parameter-efficient prompt-based fine-tuning method to the LLMs and evaluate their performance under various training settings. Results: Flan-T5-xxl outperforms all other models in both zero-shot and few-shot settings, achieving a F1 score of 0.865 with a single demonstrative example selected by kNN sampling. In prompt-based fine-tuning experiments, this foundation model also demonstrates superior performance across all low-resource settings, particularly achieving an impressive F1 score of 0.922 using the full training dataset. The smaller model, Flan-T5-xl, requires fine-tuning with only 2.3M additional parameters to achieve comparable performance to the fully fine-tuned Gatortron-base model, both surpassing 0.9 F1 score. Conclusion: Open-source instruction-tuned LLMs demonstrate impressive in-context learning capability in the mobility functioning classification task. The performance of these models can be further improved by continuing fine-tuning on a task-specific dataset.

翻译：目标：功能日益被视为全人健康的重要指标。本研究评估公开可用的大语言模型（LLMs）从临床记录中准确识别功能信息存在的能力，并探索提升该任务性能的不同策略。材料与方法：我们从Mobility NER数据集中收集了1000个句子的平衡二分类数据集，该数据集源自n2c2临床笔记。为进行评估，我们构建零样本和少样本提示，询问LLMs给定句子是否包含移动功能信息。采用随机采样和基于k近邻（kNN）的采样两种技术选择少样本示例。此外，我们对LLMs应用参数高效的基于提示的微调方法，并评估其在多种训练设置下的性能。结果：Flan-T5-xxl在零样本和少样本设置中均优于其他模型，通过kNN采样选择单个示例后F1分数达0.865。在基于提示的微调实验中，该基础模型在所有低资源设置下均表现卓越，特别是在使用完整训练数据集时F1分数高达0.922。较小的Flan-T5-xl模型仅需微调230万个额外参数即可达到与完全微调的Gatortron-base模型相当的性能，两者F1分数均超过0.9。结论：开源指令调优LLMs在移动功能分类任务中展现出卓越的上下文学习能力。通过持续在特定任务数据集上进行微调，这些模型的性能可进一步提升。