Recent advancements in large language models (LLMs) have led to the development of highly potent models like OpenAI's ChatGPT. These models have exhibited exceptional performance in a variety of tasks, such as question answering, essay composition, and code generation. However, their effectiveness in the healthcare sector remains uncertain. In this study, we seek to investigate the potential of ChatGPT to aid in clinical text mining by examining its ability to extract structured information from unstructured healthcare texts, with a focus on biological named entity recognition and relation extraction. However, our preliminary results indicate that employing ChatGPT directly for these tasks resulted in poor performance and raised privacy concerns associated with uploading patients' information to the ChatGPT API. To overcome these limitations, we propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data with labels utilizing ChatGPT and fine-tuning a local model for the downstream task. Our method has resulted in significant improvements in the performance of downstream tasks, improving the F1-score from 23.37% to 63.99% for the named entity recognition task and from 75.86% to 83.59% for the relation extraction task. Furthermore, generating data using ChatGPT can significantly reduce the time and effort required for data collection and labeling, as well as mitigate data privacy concerns. In summary, the proposed framework presents a promising solution to enhance the applicability of LLM models to clinical text mining.
翻译:近期大语言模型(LLMs)的进展催生了如OpenAI的ChatGPT等高性能模型。这些模型在问答、论文撰写和代码生成等多项任务中展现出卓越性能,然而其在医疗健康领域的有效性仍不确定。本研究旨在探究ChatGPT辅助临床文本挖掘的潜力,重点评估其从非结构化医疗文本中提取结构化信息的能力,聚焦于生物命名实体识别与关系抽取任务。初步结果表明,直接使用ChatGPT执行这些任务不仅性能欠佳,且因需将患者信息上传至ChatGPT API而引发隐私担忧。为克服这些限制,我们提出一种新型训练范式:利用ChatGPT生成大量带标注的高质量合成数据,并针对下游任务微调本地模型。该方法显著提升了下游任务性能:命名实体识别任务的F1分数从23.37%提升至63.99%,关系抽取任务从75.86%提升至83.59%。此外,使用ChatGPT生成数据可大幅减少数据采集与标注所需的时间与人力成本,同时缓解数据隐私问题。综上,本框架为提升大语言模型在临床文本挖掘中的适用性提供了可行方案。