$\text{C}^2\text{P}$: Featuring Large Language Models with Causal Reasoning

Causal reasoning is one of the primary bottlenecks that Large Language Models (LLMs) must overcome to attain human-level intelligence. Recent studies indicate that LLMs display near-random performance on reasoning tasks. To address this, we introduce the Causal Chain of Prompting ($\text{C}^2\text{P}$), a reasoning framework that aims to equip current LLMs with causal reasoning capabilities as the first framework of its kind operating autonomously without relying on external tools or modules during both the causal learning and reasoning phases. To evaluate the performance of $\text{C}^2\text{P}$, we first demonstrate that reasoning accuracy improved by over $30.7\%$ and $25.9\%$ for GPT-4 Turbo and LLaMA 3.1, respectively, when using our framework, compared to the same models without $\text{C}^2\text{P}$ on a synthetic benchmark dataset. Then, using few-shot learning of the same LLMs with $\text{C}^2\text{P}$, the reasoning accuracy increased by more than $20.05\%$ and $20.89\%$, respectively, with as few as ten examples, compared to the corresponding LLMs without $\text{C}^2\text{P}$ on the same dataset. To evaluate $\text{C}^2\text{P}$ in realistic scenarios, we utilized another benchmark dataset containing natural stories across various fields, including healthcare, medicine, economics, education, social sciences, environmental science, and marketing. The results show improved reasoning when $\text{C}^2\text{P}$ is applied, compared to cases where our framework is not used, which often leads to random and hallucinated responses. By showing the improved performance of few-shot learned GPT-4 Turbo and LLaMA 3.1 with $\text{C}^2\text{P}$, we demonstrate the generalizability of our framework.

翻译：因果推理是大语言模型（LLM）实现人类水平智能必须克服的主要瓶颈之一。近期研究表明，LLM在推理任务上表现出近乎随机的性能。为解决此问题，我们提出了因果提示链（$\text{C}^2\text{P}$），这是一个旨在为当前LLM配备因果推理能力的推理框架，作为首个在因果学习和推理阶段均不依赖外部工具或模块而自主运行的框架。为评估$\text{C}^2\text{P}$的性能，我们首先在合成基准数据集上证明，使用我们的框架时，GPT-4 Turbo和LLaMA 3.1的推理准确率分别比未使用$\text{C}^2\text{P}$的相同模型提高了超过$30.7\%$和$25.9\%$。随后，在相同数据集上，对相同LLM进行$\text{C}^2\text{P}$少样本学习，仅使用十个示例，推理准确率相比未使用$\text{C}^2\text{P}$的对应LLM分别提高了超过$20.05\%$和$20.89\%$。为在现实场景中评估$\text{C}^2\text{P}$，我们使用了另一个包含跨多个领域（包括医疗保健、医学、经济学、教育学、社会科学、环境科学和市场营销）自然故事的基准数据集。结果显示，与应用$\text{C}^2\text{P}$相比，未使用我们框架的情况常导致随机和幻觉性响应，而应用$\text{C}^2\text{P}$后推理能力得到提升。通过展示经$\text{C}^2\text{P}$少样本学习的GPT-4 Turbo和LLaMA 3.1性能的改进，我们证明了该框架的泛化能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日