This paper considers the challenges that Large Language Models (LLMs) face when reasoning over text that includes information involving uncertainty explicitly quantified via probability values. This type of reasoning is relevant to a variety of contexts ranging from everyday conversations to medical decision-making. Despite improvements in the mathematical reasoning capabilities of LLMs, they still exhibit significant difficulties when it comes to probabilistic reasoning. To deal with this problem, we first introduce the Bayesian Linguistic Inference Dataset (BLInD), a new dataset specifically designed to test the probabilistic reasoning capabilities of LLMs. We then leverage this new dataset to thoroughly illustrate the specific limitations of LLMs for tasks involving probabilistic reasoning and present several strategies that map the problem to different formal representations, including Python code, probabilistic inference algorithms, and probabilistic logical programming. We conclude by providing an evaluation of our methods on BLInD and on an adaptation of a causal reasoning question-answering dataset, which further shows their practical effectiveness.
翻译:本文探讨了大型语言模型(LLMs)在推理包含通过概率值明确量化不确定性的文本时所面临的挑战。此类推理涉及从日常对话到医疗决策等多种场景。尽管LLMs的数学推理能力有所提升,但在概率推理方面仍存在显著困难。针对这一问题,我们首先引入贝叶斯语言推理数据集(BLInD),这是一个专门用于测试LLMs概率推理能力的新数据集。随后,我们利用该数据集深入揭示了LLMs在概率推理任务中的具体局限性,并提出多种策略将问题映射至不同形式化表示,包括Python代码、概率推理算法及概率逻辑编程。最后,我们在BLInD及经改编的因果推理问答数据集上评估了所提方法,进一步验证了其实用有效性。