Since its introduction to the public, ChatGPT has had an unprecedented impact. While some experts praised AI advancements and highlighted their potential risks, others have been critical about the accuracy and usefulness of Large Language Models (LLMs). In this paper, we are interested in the ability of LLMs to identify causal relationships. We focus on the well-established GPT-4 (Turbo) and evaluate its performance under the most restrictive conditions, by isolating its ability to infer causal relationships based solely on the variable labels without being given any other context by humans, demonstrating the minimum level of effectiveness one can expect when it is provided with label-only information. We show that questionnaire participants judge the GPT-4 graphs as the most accurate in the evaluated categories, closely followed by knowledge graphs constructed by domain experts, with causal Machine Learning (ML) far behind. We use these results to highlight the important limitation of causal ML, which often produces causal graphs that violate common sense, affecting trust in them. However, we show that pairing GPT-4 with causal ML overcomes this limitation, resulting in graphical structures learnt from real data that align more closely with those identified by domain experts, compared to structures learnt by causal ML alone. Overall, our findings suggest that despite GPT-4 not being explicitly designed to reason causally, it can still be a valuable tool for causal representation, as it improves the causal discovery process of causal ML algorithms that are designed to do just that.
翻译:自ChatGPT向公众推出以来,其影响力可谓前所未有。尽管部分专家对人工智能的进步表示赞赏,并强调其潜在风险,但另一些专家则对大型语言模型(LLMs)的准确性和实用性持批判态度。本文关注LLMs识别因果关系的能力。我们聚焦于成熟的GPT-4(Turbo)模型,并在最严格的条件下评估其表现:通过隔离其仅依据变量标签推断因果关系的能力,而不提供任何其他人工背景信息,以此展示在仅提供标签信息时可预期的最低效能水平。研究表明,问卷参与者认为GPT-4生成的因果图在评估类别中最为准确,紧随其后的是领域专家构建的知识图谱,而因果机器学习(ML)的表现则远落后于两者。我们利用这些结果揭示了因果ML的重要局限——其生成的因果图常违背常识,影响了对其的信任度。然而,我们证明将GPT-4与因果ML结合可克服此局限:相较于单独使用因果ML学习得到的结构,结合方法从真实数据中学习到的图结构更接近领域专家识别的结果。总体而言,我们的发现表明,尽管GPT-4并非为因果推理而设计,它仍可成为因果表征的有价值工具,因为它能改进专门用于因果发现的因果ML算法的学习过程。