Science originates with discovering new causal knowledge from a combination of known facts and observations. Traditional causal discovery approaches mainly rely on high-quality measured variables, usually given by human experts, to find causal relations. However, the causal variables are usually unavailable in a wide range of real-world applications. The rise of large language models (LLMs) that are trained to learn rich knowledge from the massive observations of the world, provides a new opportunity to assist with discovering high-level hidden variables from the raw observational data. Therefore, we introduce COAT: Causal representatiOn AssistanT. COAT incorporates LLMs as a factor proposer that extracts the potential causal factors from unstructured data. Moreover, LLMs can also be instructed to provide additional information used to collect data values (e.g., annotation criteria) and to further parse the raw unstructured data into structured data. The annotated data will be fed to a causal learning module (e.g., the FCI algorithm) that provides both rigorous explanations of the data, as well as useful feedback to further improve the extraction of causal factors by LLMs. We verify the effectiveness of COAT in uncovering the underlying causal system with two case studies of review rating analysis and neuropathic diagnosis.
翻译:科学源于通过已知事实与观测相结合发现新的因果知识。传统因果发现方法主要依赖由人类专家提供的高质量测量变量来寻找因果关系。然而,在众多实际应用场景中,因果变量往往难以获取。大规模语言模型(LLMs)通过海量世界观测数据学习丰富知识的能力,为从原始观测数据中辅助发现高级隐藏变量提供了新机遇。为此,我们提出COAT:因果表征助手。该系统将LLMs作为因子提议器,用于从非结构化数据中提取潜在因果因子。此外,可通过指令引导LLMs提供数据值收集所需的辅助信息(如标注准则),并进一步将原始非结构化数据解析为结构化数据。经标注的数据将被输入因果学习模块(如FCI算法),该模块既能提供数据严谨的解释,又能生成有效反馈以优化LLMs的因果因子提取能力。通过评论评分分析与神经病诊断两个案例研究,我们验证了COAT在揭示潜在因果系统方面的有效性。