Revealing the underlying causal mechanisms in the real world is the key to the development of science. Despite the progress in the past decades, traditional causal discovery approaches (CDs) mainly rely on high-quality measured variables, usually given by human experts, to find causal relations. The lack of well-defined high-level variables in many real-world applications has already been a longstanding roadblock to a broader application of CDs. To this end, this paper presents Causal representatiOn AssistanT (COAT) that introduces large language models (LLMs) to bridge the gap. LLMs are trained on massive observations of the world and have demonstrated great capability in extracting key information from unstructured data. Therefore, it is natural to employ LLMs to assist with proposing useful high-level factors and crafting their measurements. Meanwhile, COAT also adopts CDs to find causal relations among the identified variables as well as to provide feedback to LLMs to iteratively refine the proposed factors. We show that LLMs and CDs are mutually beneficial and the constructed feedback provably also helps with the factor proposal. We construct and curate several synthetic and real-world benchmarks including analysis of human reviews and diagnosis of neuropathic and brain tumors, to comprehensively evaluate COAT. Extensive empirical results confirm the effectiveness and reliability of COAT with significant improvements.
翻译:揭示现实世界中的潜在因果机制是科学发展的关键。尽管过去数十年取得了进展,但传统因果发现方法主要依赖高质量测量变量(通常由人类专家提供)来寻找因果关系。在许多实际应用中,缺乏明确定义的高层变量长期以来一直是阻碍因果发现方法更广泛应用的瓶颈。为此,本文提出因果表征助手,通过引入大型语言模型来弥合这一鸿沟。大型语言模型基于对世界的大量观测进行训练,已展现出从非结构化数据中提取关键信息的卓越能力。因此,自然可以运用大型语言模型来协助提出有效的高层因素并构建其测量方法。同时,因果表征助手也采用因果发现方法来识别已确定变量间的因果关系,并为大型语言模型提供反馈以迭代优化所提出的因素。我们证明大型语言模型与因果发现方法具有互惠性,且所构建的反馈机制经证明也有助于因素提纯。我们构建并策划了包括人类评论分析、神经病变与脑肿瘤诊断在内的多个合成与真实世界基准测试,以全面评估因果表征助手。大量实证结果证实了因果表征助手的有效性和可靠性,并显示出显著的性能提升。