In this study, we investigated the potential of ChatGPT, a large language model developed by OpenAI, for the clinical named entity recognition task defined in the 2010 i2b2 challenge, in a zero-shot setting with two different prompt strategies. We compared its performance with GPT-3 in a similar zero-shot setting, as well as a fine-tuned BioClinicalBERT model using a set of synthetic clinical notes from MTSamples. Our findings revealed that ChatGPT outperformed GPT-3 in the zero-shot setting, with F1 scores of 0.418 (vs.0.250) and 0.620 (vs. 0.480) for exact- and relaxed-matching, respectively. Moreover, prompts affected ChatGPT's performance greatly, with relaxed-matching F1 scores of 0.628 vs.0.541 for two different prompt strategies. Although ChatGPT's performance was still lower than that of the supervised BioClinicalBERT model (i.e., relaxed-matching F1 scores of 0.620 vs. 0.888), our study demonstrates the great potential of ChatGPT for clinical NER tasks in a zero-shot setting, which is much more appealing as it does not require any annotation.
翻译:本研究探讨了OpenAI开发的大型语言模型ChatGPT在2010年i2b2挑战定义的临床命名实体识别任务中的潜力,采用零样本设置并应用两种不同的提示策略。我们将其性能与类似零样本设置下的GPT-3以及基于MTSamples合成临床笔记微调的BioClinicalBERT模型进行了比较。研究结果显示,在零样本设置中,ChatGPT的表现优于GPT-3,精确匹配和宽松匹配的F1分数分别为0.418(对比0.250)和0.620(对比0.480)。此外,提示策略对ChatGPT的性能影响显著,两种不同提示策略下的宽松匹配F1分数分别为0.628和0.541。尽管ChatGPT的性能仍低于有监督的BioClinicalBERT模型(即宽松匹配F1分数为0.620对比0.888),但本研究证明了ChatGPT在零样本设置下用于临床NER任务的巨大潜力,因其无需任何标注数据而更具吸引力。