More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of artificial intelligence (AI) systems. However, these benchmarks are often flawed and many aspects of common sense remain untested. Consequently, we do not currently have any reliable way of measuring to what extent existing AI systems have achieved these abilities. This paper surveys the development and uses of AI commonsense benchmarks. We discuss the nature of common sense; the role of common sense in AI; the goals served by constructing commonsense benchmarks; and desirable features of commonsense benchmarks. We analyze the common flaws in benchmarks, and we argue that it is worthwhile to invest the work needed ensure that benchmark examples are consistently high quality. We survey the various methods of constructing commonsense benchmarks. We enumerate 139 commonsense benchmarks that have been developed: 102 text-based, 18 image-based, 12 video based, and 7 simulated physical environments. We discuss the gaps in the existing benchmarks and aspects of commonsense reasoning that are not addressed in any existing benchmark. We conclude with a number of recommendations for future development of commonsense AI benchmarks.
翻译:已开发出超过一百个基准测试,用于评估人工智能系统的常识知识和常识推理能力。然而,这些基准往往存在缺陷,且常识的许多方面仍未得到测试。因此,我们目前缺乏可靠的方法来衡量现有AI系统在何种程度上实现了这些能力。本文综述了AI常识基准的开发和用途。我们探讨了常识的本质;常识在AI中的作用;构建常识基准的目标;以及常识基准的理想特征。我们分析了基准中的常见缺陷,并论证了投入必要工作以确保基准示例始终保持高质量是值得的。我们调研了构建常识基准的各种方法。我们列举了已开发的139个常识基准:其中102个基于文本、18个基于图像、12个基于视频、7个基于模拟物理环境。我们讨论了现有基准的空白,以及任何现有基准均未涉及的常识推理方面。最后,我们为未来常识AI基准的开发提出了一系列建议。