Causal reasoning is one of the primary bottlenecks that Large Language Models (LLMs) must overcome to attain human-level intelligence. Recent studies indicate that LLMs display near-random performance on reasoning tasks. To address this, we introduce the Causal Chain of Prompting ($\text{C}^2\text{P}$), a reasoning framework that aims to equip current LLMs with causal reasoning capabilities as the first framework of its kind operating autonomously without relying on external tools or modules during both the causal learning and reasoning phases. To evaluate the performance of $\text{C}^2\text{P}$, we first demonstrate that reasoning accuracy improved by over $30.7\%$ and $25.9\%$ for GPT-4 Turbo and LLaMA 3.1, respectively, when using our framework, compared to the same models without $\text{C}^2\text{P}$ on a synthetic benchmark dataset. Then, using few-shot learning of the same LLMs with $\text{C}^2\text{P}$, the reasoning accuracy increased by more than $20.05\%$ and $20.89\%$, respectively, with as few as ten examples, compared to the corresponding LLMs without $\text{C}^2\text{P}$ on the same dataset. To evaluate $\text{C}^2\text{P}$ in realistic scenarios, we utilized another benchmark dataset containing natural stories across various fields, including healthcare, medicine, economics, education, social sciences, environmental science, and marketing. The results show improved reasoning when $\text{C}^2\text{P}$ is applied, compared to cases where our framework is not used, which often leads to random and hallucinated responses. By showing the improved performance of few-shot learned GPT-4 Turbo and LLaMA 3.1 with $\text{C}^2\text{P}$, we demonstrate the generalizability of our framework.
翻译:因果推理是大语言模型(LLM)实现人类水平智能必须克服的主要瓶颈之一。近期研究表明,LLM在推理任务上表现出近乎随机的性能。为解决此问题,我们提出了因果提示链($\text{C}^2\text{P}$),这是一个旨在为当前LLM配备因果推理能力的推理框架,作为首个在因果学习和推理阶段均不依赖外部工具或模块而自主运行的框架。为评估$\text{C}^2\text{P}$的性能,我们首先在合成基准数据集上证明,使用我们的框架时,GPT-4 Turbo和LLaMA 3.1的推理准确率分别比未使用$\text{C}^2\text{P}$的相同模型提高了超过$30.7\%$和$25.9\%$。随后,在相同数据集上,对相同LLM进行$\text{C}^2\text{P}$少样本学习,仅使用十个示例,推理准确率相比未使用$\text{C}^2\text{P}$的对应LLM分别提高了超过$20.05\%$和$20.89\%$。为在现实场景中评估$\text{C}^2\text{P}$,我们使用了另一个包含跨多个领域(包括医疗保健、医学、经济学、教育学、社会科学、环境科学和市场营销)自然故事的基准数据集。结果显示,与应用$\text{C}^2\text{P}$相比,未使用我们框架的情况常导致随机和幻觉性响应,而应用$\text{C}^2\text{P}$后推理能力得到提升。通过展示经$\text{C}^2\text{P}$少样本学习的GPT-4 Turbo和LLaMA 3.1性能的改进,我们证明了该框架的泛化能力。