Commonsense datasets have been well developed in Natural Language Processing, mainly through crowdsource human annotation. However, there are debates on the genuineness of commonsense reasoning benchmarks. In specific, a significant portion of instances in some commonsense benchmarks do not concern commonsense knowledge. That problem would undermine the measurement of the true commonsense reasoning ability of evaluated models. It is also suggested that the problem originated from a blurry concept of commonsense knowledge, as distinguished from other types of knowledge. To demystify all of the above claims, in this study, we survey existing definitions of commonsense knowledge, ground into the three frameworks for defining concepts, and consolidate them into a multi-framework unified definition of commonsense knowledge (so-called consolidated definition). We then use the consolidated definition for annotations and experiments on the CommonsenseQA and CommonsenseQA 2.0 datasets to examine the above claims. Our study shows that there exists a large portion of non-commonsense-knowledge instances in the two datasets, and a large performance gap on these two subsets where Large Language Models (LLMs) perform worse on commonsense-knowledge instances.
翻译:在自然语言处理领域,常识数据集主要通过众包人工标注的方式得到了充分发展。然而,关于常识推理基准测试的真实性一直存在争议。具体而言,某些常识基准测试中有相当一部分实例并不涉及常识知识。这一问题会削弱对被评估模型真实常识推理能力的测量。研究还指出,该问题源于常识知识与其他类型知识之间模糊的概念界定。为澄清上述所有主张,本研究系统梳理了现有的常识知识定义,将其置于三种概念定义框架中进行考察,并整合形成了常识知识的多框架统一定义(即整合定义)。随后,我们运用该整合定义对CommonsenseQA和CommonsenseQA 2.0数据集进行标注与实验,以检验上述主张。研究表明:这两个数据集中存在大量非常识知识实例,且大语言模型在常识知识实例子集上的表现显著逊色于非常识知识子集,呈现出明显的性能差距。