Evaluation of ChatGPT as a Question Answering System for Answering Complex Questions

ChatGPT is a powerful large language model (LLM) that has made remarkable progress in natural language understanding. Nevertheless, the performance and limitations of the model still need to be extensively evaluated. As ChatGPT covers resources such as Wikipedia and supports natural language question answering, it has garnered attention as a potential replacement for traditional knowledge based question answering (KBQA) models. Complex question answering is a challenge task of KBQA, which comprehensively tests the ability of models in semantic parsing and reasoning. To assess the performance of ChatGPT as a question answering system (QAS) using its own knowledge, we present a framework that evaluates its ability to answer complex questions. Our approach involves categorizing the potential features of complex questions and describing each test question with multiple labels to identify combinatorial reasoning. Following the black-box testing specifications of CheckList proposed by Ribeiro et.al, we develop an evaluation method to measure the functionality and reliability of ChatGPT in reasoning for answering complex questions. We use the proposed framework to evaluate the performance of ChatGPT in question answering on 8 real-world KB-based CQA datasets, including 6 English and 2 multilingual datasets, with a total of approximately 190,000 test cases. We compare the evaluation results of ChatGPT, GPT-3.5, GPT-3, and FLAN-T5 to identify common long-term problems in LLMs. The dataset and code are available at https://github.com/tan92hl/Complex-Question-Answering-Evaluation-of-ChatGPT.

翻译：ChatGPT是一种强大的大型语言模型（LLM），在自然语言理解方面取得了显著进展。然而，该模型的性能与局限性仍需广泛评估。由于ChatGPT涵盖维基百科等资源并支持自然语言问答，它作为传统基于知识库的问答（KBQA）模型的潜在替代方案引起了关注。复杂问答是KBQA中的一项挑战性任务，全面检验了模型在语义解析和推理方面的能力。为评估ChatGPT凭借自身知识作为问答系统（QAS）的性能，我们提出一个框架来评估其回答复杂问题的能力。我们的方法包括对复杂问题的潜在特征进行分类，并用多个标签描述每个测试问题以识别组合推理。遵循Ribeiro等人提出的CheckList黑盒测试规范，我们开发了一种评估方法来衡量ChatGPT在回答复杂问题推理中的功能性和可靠性。我们使用该框架评估ChatGPT在8个真实世界基于知识库的复杂问答（CQA）数据集上的问答性能，这些数据集包括6个英文和2个多语言数据集，总计约19万个测试案例。我们将ChatGPT、GPT-3.5、GPT-3和FLAN-T5的评估结果进行比较，以识别LLMs中常见的长期问题。数据集和代码可在https://github.com/tan92hl/Complex-Question-Answering-Evaluation-of-ChatGPT获取。

相关内容

ChatGPT

关注 258

ChatGPT（全名：Chat Generative Pre-trained Transformer），美国OpenAI 研发的聊天机器人程序 [1] ，于2022年11月30日发布。ChatGPT是人工智能技术驱动的自然语言处理工具，它能够通过学习和理解人类的语言来进行对话，还能根据聊天的上下文进行互动，真正像人类一样来聊天交流，甚至能完成撰写邮件、视频脚本、文案、翻译、代码，写论文任务。 [1] https://openai.com/blog/chatgpt/

不可错过！700+ppt《因果推理》课程！杜克大学Fan Li教程

专知会员服务

73+阅读 · 2022年7月11日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

专知会员服务

111+阅读 · 2020年6月10日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日