As an indispensable ingredient of intelligence, commonsense reasoning is crucial for large language models (LLMs) in real-world scenarios. In this paper, we propose CORECODE, a dataset that contains abundant commonsense knowledge manually annotated on dyadic dialogues, to evaluate the commonsense reasoning and commonsense conflict detection capabilities of Chinese LLMs. We categorize commonsense knowledge in everyday conversations into three dimensions: entity, event, and social interaction. For easy and consistent annotation, we standardize the form of commonsense knowledge annotation in open-domain dialogues as "domain: slot = value". A total of 9 domains and 37 slots are defined to capture diverse commonsense knowledge. With these pre-defined domains and slots, we collect 76,787 commonsense knowledge annotations from 19,700 dialogues through crowdsourcing. To evaluate and enhance the commonsense reasoning capability for LLMs on the curated dataset, we establish a series of dialogue-level reasoning and detection tasks, including commonsense knowledge filling, commonsense knowledge generation, commonsense conflict phrase detection, domain identification, slot identification, and event causal inference. A wide variety of existing open-source Chinese LLMs are evaluated with these tasks on our dataset. Experimental results demonstrate that these models are not competent to predict CORECODE's plentiful reasoning content, and even ChatGPT could only achieve 0.275 and 0.084 accuracy on the domain identification and slot identification tasks under the zero-shot setting. We release the data and codes of CORECODE at https://github.com/danshi777/CORECODE to promote commonsense reasoning evaluation and study of LLMs in the context of daily conversations.
翻译:作为智能不可或缺的要素,常识推理对于大语言模型在现实场景中的应用至关重要。本文提出CORECODE数据集,该数据集包含在双人对话上人工标注的丰富常识知识,旨在评估中文大语言模型的常识推理与常识冲突检测能力。我们将日常对话中的常识知识划分为三个维度:实体、事件与社会交互。为实现便捷且一致的标注,我们规范了开放域对话中常识知识标注的形式,定义为"领域:槽位=值"。共定义9个领域与37个槽位以捕捉多样化的常识知识。基于预定义的领域与槽位,我们通过众包方式从19,700组对话中收集了76,787条常识知识标注。为评估并提升大语言模型在构建数据集上的常识推理能力,我们建立了一系列对话级推理与检测任务,包括常识知识填充、常识知识生成、常识冲突短语检测、领域识别、槽位识别及事件因果推理。我们利用该数据集对多种现有开源中文大语言模型进行了任务评估。实验结果表明,这些模型无法准确预测CORECODE丰富的推理内容,即使在零样本设置下,ChatGPT在领域识别与槽位识别任务上的准确率也仅分别达到0.275与0.084。我们已在https://github.com/danshi777/CORECODE开源CORECODE的数据与代码,以促进日常对话场景下大语言模型常识推理的评估与研究。