Large Language Models (LLMs) like ChatGPT have demonstrated amazing capabilities in comprehending user intents and generate reasonable and useful responses. Beside their ability to chat, their capabilities in various natural language processing (NLP) tasks are of interest to the research community. In this paper, we focus on assessing the overall ability of ChatGPT in 4 different medical information extraction (MedIE) tasks across 6 benchmark datasets. We present the systematically analysis by measuring ChatGPT's performance, explainability, confidence, faithfulness, and uncertainty. Our experiments reveal that: (a) ChatGPT's performance scores on MedIE tasks fall behind those of the fine-tuned baseline models. (b) ChatGPT can provide high-quality explanations for its decisions, however, ChatGPT is over-confident in its predcitions. (c) ChatGPT demonstrates a high level of faithfulness to the original text in the majority of cases. (d) The uncertainty in generation causes uncertainty in information extraction results, thus may hinder its applications in MedIE tasks.
翻译:以ChatGPT为代表的大型语言模型在理解用户意图并生成合理且有用的回应方面展现出惊人的能力。除了对话能力外,其在各类自然语言处理任务中的潜力也引起了研究界的广泛关注。本文聚焦于评估ChatGPT在6个基准数据集、4种不同医学信息抽取任务上的综合能力。我们通过系统测量ChatGPT的性能、可解释性、置信度、忠实度及不确定性展开分析。实验结果表明:(a) ChatGPT在医学信息抽取任务上的性能评分落后于经过微调的基线模型。(b) ChatGPT能够为其决策提供高质量的解释,但其预测存在过度自信现象。(c) 在多数情况下,ChatGPT对原始文本展现出较高的忠实度。(d) 生成过程的不确定性会导致信息抽取结果的不确定性,这可能阻碍其在医学信息抽取任务中的应用。