We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.
翻译:我们通过一项针对离散概率问题的受控基准研究,探究了大语言模型的概率推理能力。我们构建了两个数据集,分别包含一组标准练习题和一组旨在触发启发式推理的反直觉练习题,并评估了8个最先进的模型,每个模型均在有无思维链提示的条件下进行测试。模型在标准问题上的平均准确率为0.96,但在反直觉问题上仅为0.59。我们进一步提供了词元偏见的经验证据:当典型表述被替换为伪装变体时,性能下降超过20%。在提示中嵌入误导性建议会使性能下降高达34%,且没有模型能完全免疫。综合来看,报告中的发现表明,尽管当前的大语言模型在高级数学问题上表现出色,但它们尚未成为真正的概率推理者。