Large Language Models (LLMs) have shown state-of-the-art performance in a variety of tasks, including arithmetic and reasoning; however, to gauge the intellectual capabilities of LLMs, causal reasoning has become a reliable proxy for validating a general understanding of the mechanics and intricacies of the world similar to humans. Previous works in natural language processing (NLP) have either focused on open-ended causal reasoning via causal commonsense reasoning (CCR) or framed a symbolic representation-based question answering for theoretically backed-up analysis via a causal inference engine. The former adds an advantage of real-world grounding but lacks theoretically backed-up analysis/validation, whereas the latter is far from real-world grounding. In this work, we bridge this gap by proposing the COLD (Causal reasOning in cLosed Daily activities) framework, which is built upon human understanding of daily real-world activities to reason about the causal nature of events. We show that the proposed framework facilitates the creation of enormous causal queries (~ 9 million) and comes close to the mini-turing test, simulating causal reasoning to evaluate the understanding of a daily real-world task. We evaluate multiple LLMs on the created causal queries and find that causal reasoning is challenging even for activities trivial to humans. We further explore (the causal reasoning abilities of LLMs) using the backdoor criterion to determine the causal strength between events.
翻译:大型语言模型(LLM)在算术与推理等多种任务中展现出最先进的性能;然而,要衡量LLM的智能水平,因果推理已成为验证其是否具备类似人类对世界运行机制与复杂性的普遍理解能力的可靠指标。以往自然语言处理(NLP)领域的研究,要么通过因果常识推理(CCR)专注于开放式的因果推理,要么构建基于符号表征的问答框架,借助因果推理引擎进行理论支撑的分析。前者具有现实世界基础的优势,但缺乏理论支撑的分析/验证;而后者则与现实基础相去甚远。本研究通过提出COLD(日常封闭活动中的因果推理)框架来弥合这一鸿沟,该框架基于人类对日常现实世界活动的理解来推理事件的因果本质。我们证明,所提框架能够支持生成海量因果查询(约900万条),并接近微型图灵测试的标准,通过模拟因果推理来评估对日常现实任务的理解程度。我们在生成的因果查询上评估了多种LLM,发现即使对人类而言微不足道的活动,因果推理对模型仍具挑战性。我们进一步运用后门准则探究LLM的因果推理能力,以确定事件间的因果强度。