Causal graph recovery is essential in the field of causal inference. Traditional methods are typically knowledge-based or statistical estimation-based, which are limited by data collection biases and individuals' knowledge about factors affecting the relations between variables of interests. The advance of large language models (LLMs) provides opportunities to address these problems. We propose a novel method that utilizes the extensive knowledge contained within a large corpus of scientific literature to deduce causal relationships in general causal graph recovery tasks. This method leverages Retrieval Augmented-Generation (RAG) based LLMs to systematically analyze and extract pertinent information from a comprehensive collection of research papers. Our method first retrieves relevant text chunks from the aggregated literature. Then, the LLM is tasked with identifying and labelling potential associations between factors. Finally, we give a method to aggregate the associational relationships to build a causal graph. We demonstrate our method is able to construct high quality causal graphs on the well-known SACHS dataset solely from literature.
翻译:因果图恢复是因果推断领域的核心问题。传统方法通常基于知识或统计估计,受限于数据收集偏差及个体对影响变量间关系因素的认识局限。大语言模型的发展为解决这些问题提供了新机遇。本文提出一种新方法,利用大规模科学文献语料蕴含的广泛知识,在通用因果图恢复任务中推导因果关系。该方法基于检索增强生成的大语言模型,系统分析并提取综合文献集合中的相关信息。具体而言,首先从聚合文献中检索相关文本片段,随后由大语言模型识别并标注因素间的潜在关联,最终通过整合关联关系构建因果图。实验表明,本方法仅需文献信息即可在经典SACHS数据集上构建出高质量的因果图。