Large language models (LLMs) demonstrate remarkable breadth of knowledge, yet their ability to reason about computational processes remains poorly understood. Closing this gap matters for practitioners who rely on LLMs to guide algorithm selection and deployment. We address this limitation using causal discovery as a testbed and evaluate eight frontier LLMs against ground truth derived from large-scale algorithm executions and find systematic, near-total failure. Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best model is most consistent with benchmark memorization rather than principled reasoning. We term this failure algorithmic blindness and argue it reflects a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction.
翻译:大型语言模型(LLMs)展现出卓越的广博知识,但其对计算过程的推理能力仍未被充分理解。对于依赖LLMs指导算法选择与部署的实践者而言,弥合这一认知差距至关重要。本研究以因果发现为测试平台,通过基于大规模算法执行结果构建的基准真值评估八个前沿LLMs,发现其存在系统性、近乎完全的失效现象。模型生成的置信区间远宽于真实区间,却在多数情况下仍未能覆盖真实的算法均值;大多数模型的表现甚至逊于随机猜测,而最佳模型略高于随机水平的边际表现,最符合基准记忆特征而非基于原理的推理。我们将这种失效现象称为算法性盲区,并论证其反映了关于算法的陈述性知识与经过校准的程序性预测之间存在根本性鸿沟。