We study logical reasoning in language models by asking whether their errors follow established human fallacy patterns. Using the Erotetic Theory of Reasoning (ETR) and its open-source implementation, PyETR, we programmatically generate 383 formally specified reasoning problems and evaluate 38 models. For each response, we judge logical correctness and, when incorrect, whether it matches an ETR-predicted fallacy. Two results stand out: (i) as a capability proxy (Chatbot Arena Elo) increases, a larger share of a model's incorrect answers are ETR-predicted fallacies $(\rho=0.360, p=0.0265)$, while overall correctness on this dataset shows no correlation with capability; (ii) reversing premise order significantly reduces fallacy production for many models, mirroring human order effects. Methodologically, PyETR provides an open-source pipeline for unbounded, synthetic, contamination-resistant reasoning tests linked to a cognitive theory, enabling analyses that focus on error composition rather than error rate.
翻译:本研究通过探究语言模型的逻辑推理错误是否遵循已知的人类谬误模式来考察其推理能力。利用推理的疑问理论(ETR)及其开源实现PyETR,我们通过程序化方法生成了383个形式化定义的推理问题,并对38个模型进行了评估。针对每个响应,我们判断其逻辑正确性,并在错误发生时判定其是否匹配ETR预测的谬误类型。两个突出发现是:(i)随着能力代理指标(Chatbot Arena Elo分数)的提升,模型错误答案中属于ETR预测谬误的比例显著增加($\rho=0.360, p=0.0265$),而模型在该数据集上的整体正确率与能力指标无相关性;(ii)调换前提顺序能显著降低多数模型的谬误生成率,这与人类推理中的顺序效应相呼应。方法论上,PyETR提供了一个与认知理论关联的开源流程,可生成无限量、合成式、抗数据污染的推理测试,使分析重点从错误率转向错误构成。