Large language models have gained considerable interest for their impressive performance on various tasks. Among these models, ChatGPT developed by OpenAI has become extremely popular among early adopters who even regard it as a disruptive technology in many fields like customer service, education, healthcare, and finance. It is essential to comprehend the opinions of these initial users as it can provide valuable insights into the potential strengths, weaknesses, and success or failure of the technology in different areas. This research examines the responses generated by ChatGPT from different Conversational QA corpora. The study employed BERT similarity scores to compare these responses with correct answers and obtain Natural Language Inference(NLI) labels. Evaluation scores were also computed and compared to determine the overall performance of GPT-3 \& GPT-4. Additionally, the study identified instances where ChatGPT provided incorrect answers to questions, providing insights into areas where the model may be prone to error.
翻译:大型语言模型因其在各种任务上的出色表现而引起了广泛关注。其中,OpenAI开发的ChatGPT在早期用户中极为流行,甚至被视作在客户服务、教育、医疗和金融等多个领域具有颠覆性的技术。理解这些初始用户的观点至关重要,因为这能为我们提供关于该技术在不同领域潜在优势、劣势以及成败的宝贵见解。本研究检验了ChatGPT在不同对话式问答语料库中生成的回答。研究采用BERT相似度得分将这些回答与正确答案进行比较,并获取自然语言推理标签。同时计算并比较评估得分,以确定GPT-3与GPT-4的整体性能。此外,研究还识别了ChatGPT提供错误答案的实例,揭示了该模型可能容易出错的领域。