The performance of deep learning-based natural language processing systems is based on large amounts of labeled training data which, in the clinical domain, are not easily available or affordable. Weak supervision and in-context learning offer partial solutions to this issue, particularly using large language models (LLMs), but their performance still trails traditional supervised methods with moderate amounts of gold-standard data. In particular, inferencing with LLMs is computationally heavy. We propose an approach leveraging fine-tuning LLMs and weak supervision with virtually no domain knowledge that still achieves consistently dominant performance. Using a prompt-based approach, the LLM is used to generate weakly-labeled data for training a downstream BERT model. The weakly supervised model is then further fine-tuned on small amounts of gold standard data. We evaluate this approach using Llama2 on three different n2c2 datasets. With no more than 10 gold standard notes, our final BERT models weakly supervised by fine-tuned Llama2-13B consistently outperformed out-of-the-box PubMedBERT by 4.7% to 47.9% in F1 scores. With only 50 gold standard notes, our models achieved close performance to fully fine-tuned systems.
翻译:基于深度学习的自然语言处理系统性能依赖于大量标注训练数据,而在临床领域中,此类数据既不易获取也成本高昂。弱监督与上下文学习为此提供了部分解决方案,特别是利用大型语言模型,但其性能仍落后于使用适量黄金标准数据的传统监督方法。尤其值得注意的是,大型语言模型的推理计算负担沉重。我们提出一种方法,通过微调大型语言模型并结合几乎无需领域知识的弱监督,仍能实现持续领先的性能。该方法采用提示工程策略,利用大型语言模型生成弱标注数据以训练下游BERT模型。随后,该弱监督模型可在少量黄金标准数据上进一步微调。我们在三个不同的n2c2数据集上使用Llama2评估了该方法。实验表明,在不超过10份黄金标准病历的情况下,经微调Llama2-13B弱监督的最终BERT模型,其F1分数始终优于未经调整的PubMedBERT模型,提升幅度达4.7%至47.9%。当使用仅50份黄金标准病历时,我们的模型性能已接近完全微调系统。