The capability of Large Language Models (LLMs) like ChatGPT to comprehend user intent and provide reasonable responses has made them extremely popular lately. In this paper, we focus on assessing the overall ability of ChatGPT using 7 fine-grained information extraction (IE) tasks. Specially, we present the systematically analysis by measuring ChatGPT's performance, explainability, calibration, and faithfulness, and resulting in 15 keys from either the ChatGPT or domain experts. Our findings reveal that ChatGPT's performance in Standard-IE setting is poor, but it surprisingly exhibits excellent performance in the OpenIE setting, as evidenced by human evaluation. In addition, our research indicates that ChatGPT provides high-quality and trustworthy explanations for its decisions. However, there is an issue of ChatGPT being overconfident in its predictions, which resulting in low calibration. Furthermore, ChatGPT demonstrates a high level of faithfulness to the original text in the majority of cases. We manually annotate and release the test sets of 7 fine-grained IE tasks contains 14 datasets to further promote the research. The datasets and code are available at https://github.com/pkuserc/ChatGPT_for_IE.
翻译:近年来,像ChatGPT这样的大型语言模型因理解用户意图并提供合理响应的能力而广受欢迎。本文聚焦于通过7项细粒度信息抽取任务全面评估ChatGPT的整体能力。具体而言,我们通过衡量ChatGPT的性能、可解释性、校准度和忠实度进行系统分析,并从ChatGPT或领域专家视角提炼出15项关键指标。研究结果表明,ChatGPT在标准信息抽取设定下表现欠佳,但在开放信息抽取设定中却展现出令人惊讶的优异性能——这一点已通过人工评估得到验证。此外,我们的研究显示ChatGPT对其决策能提供高质量且可信的解释,但存在预测过于自信导致校准度偏低的问题。值得注意的是,ChatGPT在大多数情况下对原始文本表现出高度的忠实性。我们手动标注并发布了包含14个数据集的7项细粒度信息抽取任务测试集以促进后续研究,相关数据集与代码已公开于https://github.com/pkuserc/ChatGPT_for_IE。