Large language models (LLMs) such as ChatGPT and GPT-4 have made significant progress in NLP. However, their ability to memorize, represent, and leverage commonsense knowledge has been a well-known pain point for LLMs. It remains unclear that: (1) Can GPTs effectively answer commonsense questions? (2) Are GPTs knowledgeable in commonsense? (3) Are GPTs aware of the underlying commonsense knowledge for answering a specific question? (4) Can GPTs effectively leverage commonsense for answering questions? To evaluate the above commonsense problems, we conduct a series of experiments to evaluate ChatGPT's commonsense abilities, and the experimental results show that: (1) GPTs can achieve good QA accuracy in commonsense tasks, while they still struggle with certain types of knowledge. (2) ChatGPT is knowledgeable, and can accurately generate most of the commonsense knowledge using knowledge prompts. (3) Despite its knowledge, ChatGPT is an inexperienced commonsense problem solver, which cannot precisely identify the needed commonsense knowledge for answering a specific question, i.e., ChatGPT does not precisely know what commonsense knowledge is required to answer a question. The above findings raise the need to investigate better mechanisms for utilizing commonsense knowledge in LLMs, such as instruction following, better commonsense guidance, etc.
翻译:像ChatGPT和GPT-4这样的大型语言模型(LLMs)在自然语言处理领域取得了显著进展。然而,它们在记忆、表征和利用常识知识方面的能力一直是LLMs公认的痛点。目前尚不明确:(1) GPT能否有效回答常识问题?(2) GPT是否具备常识知识?(3) GPT是否知晓回答特定问题所需的潜在常识知识?(4) GPT能否有效利用常识来回答问题?为评估上述常识问题,我们开展了一系列实验来评估ChatGPT的常识能力,实验结果表明:(1) GPT在常识任务中能实现较高的问答准确率,但在某些知识类型上仍存在困难。(2) ChatGPT知识渊博,能利用知识提示准确生成大部分常识知识。(3) 尽管知识储备丰富,ChatGPT却是一个缺乏经验的常识问题求解器,无法精确识别回答特定问题所需的常识知识,即它并不确切知道回答一个问题需要何种常识知识。上述发现表明,需要探究在LLMs中更好利用常识知识的机制,例如指令遵循、更优的常识指导等。