Benchmarks for Automated Commonsense Reasoning: A Survey

More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of artificial intelligence (AI) systems. However, these benchmarks are often flawed and many aspects of common sense remain untested. Consequently, we do not currently have any reliable way of measuring to what extent existing AI systems have achieved these abilities. This paper surveys the development and uses of AI commonsense benchmarks. We discuss the nature of common sense; the role of common sense in AI; the goals served by constructing commonsense benchmarks; and desirable features of commonsense benchmarks. We analyze the common flaws in benchmarks, and we argue that it is worthwhile to invest the work needed ensure that benchmark examples are consistently high quality. We survey the various methods of constructing commonsense benchmarks. We enumerate 139 commonsense benchmarks that have been developed: 102 text-based, 18 image-based, 12 video based, and 7 simulated physical environments. We discuss the gaps in the existing benchmarks and aspects of commonsense reasoning that are not addressed in any existing benchmark. We conclude with a number of recommendations for future development of commonsense AI benchmarks.

翻译：已开发出超过一百个基准测试，用于评估人工智能系统的常识知识和常识推理能力。然而，这些基准往往存在缺陷，且常识的许多方面仍未得到测试。因此，我们目前缺乏可靠的方法来衡量现有AI系统在何种程度上实现了这些能力。本文综述了AI常识基准的开发和用途。我们探讨了常识的本质；常识在AI中的作用；构建常识基准的目标；以及常识基准的理想特征。我们分析了基准中的常见缺陷，并论证了投入必要工作以确保基准示例始终保持高质量是值得的。我们调研了构建常识基准的各种方法。我们列举了已开发的139个常识基准：其中102个基于文本、18个基于图像、12个基于视频、7个基于模拟物理环境。我们讨论了现有基准的空白，以及任何现有基准均未涉及的常识推理方面。最后，我们为未来常识AI基准的开发提出了一系列建议。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日