ThinkSum: Probabilistic reasoning over sets using large language models

Large language models (LLMs) have a substantial capacity for high-level analogical reasoning: reproducing patterns in linear text that occur in their training data (zero-shot evaluation) or in the provided context (few-shot in-context learning). However, recent studies show that even the more advanced LLMs fail in scenarios that require reasoning over multiple objects or facts and making sequences of logical deductions. We propose a two-stage probabilistic inference paradigm, ThinkSum, which reasons over sets of objects or facts in a structured manner. In the first stage (Think - retrieval of associations), a LLM is queried in parallel over a set of phrases extracted from the prompt or an auxiliary model call. In the second stage (Sum - probabilistic inference or reasoning), the results of these queries are aggregated to make the final prediction. We demonstrate the possibilities and advantages of ThinkSum on the BIG-bench suite of LLM evaluation tasks, achieving improvements over the state of the art using GPT-family models on thirteen difficult tasks, often with far smaller model variants. We also compare and contrast ThinkSum with other proposed modifications to direct prompting of LLMs, such as variants of chain-of-thought prompting. Our results suggest that because the probabilistic inference in ThinkSum is performed outside of calls to the LLM, ThinkSum is less sensitive to prompt design, yields more interpretable predictions, and can be flexibly combined with latent variable models to extract structured knowledge from LLMs. Overall, our proposed paradigm represents a promising approach for enhancing the reasoning capabilities of LLMs.

翻译：大语言模型具有强大的高层次类比推理能力：能够重现训练数据中出现的线性文本模式（零样本评估）或提供的上下文模式（少样本上下文学习）。然而，近期研究表明，即便是最先进的大语言模型，在需要推理多个对象或事实并进行逻辑演绎序列的复杂场景中仍会失败。本文提出一种两阶段概率推理范式——ThinkSum，该方法以结构化方式对对象或事实集合进行推理。在第一阶段（Think——关联检索），对从提示或辅助模型调用中提取的短语集合并行查询大语言模型；第二阶段（Sum——概率推理或归纳），聚合这些查询结果以作出最终预测。我们在BIG-bench大语言模型评估任务套件上验证了ThinkSum的效能与优势，在十三项困难任务中使用GPT系列模型取得了超越当前最优水平的改进（往往使用更小的模型变体）。我们还将ThinkSum与直接提示大语言模型的其他改进方案（如思维链提示的变体）进行了对比。结果表明，由于ThinkSum中的概率推理在大语言模型调用外部执行，该方法对提示设计的敏感度更低、预测结果可解释性更强，并能灵活结合潜变量模型从大语言模型中提取结构化知识。总体而言，本文提出的范式为增强大语言模型的推理能力提供了一个有前景的研究方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日