Evaluating LLM forecasting capabilities is constrained by a fundamental tension: prospective evaluation offers methodological rigor but prohibitive latency, while retrospective forecasting (RF) -- evaluating on already-resolved events -- faces rapidly shrinking clean evaluation data as SOTA models possess increasingly recent knowledge cutoffs. Simulated Ignorance (SI), prompting models to suppress pre-cutoff knowledge, has emerged as a potential solution. We provide the first systematic test of whether SI can approximate True Ignorance (TI). Across 477 competition-level questions and 9 models, we find that SI fails systematically: (1) cutoff instructions leave a 52% performance gap between SI and TI; (2) chain-of-thought reasoning fails to suppress prior knowledge, even when reasoning traces contain no explicit post-cutoff references; (3) reasoning-optimized models exhibit worse SI fidelity despite superior reasoning trace quality. These findings demonstrate that prompts cannot reliably "rewind" model knowledge. We conclude that RF on pre-cutoff events is methodologically flawed; we recommend against using SI-based retrospective setups to benchmark forecasting capabilities.
翻译:评估大语言模型的预测能力面临一个根本性矛盾:前瞻性评估方法严谨但延迟过高,而回顾性预测——基于已发生事件进行评估——随着前沿模型知识截止点日益逼近当前时间,其可用的纯净评估数据正急剧减少。模拟无知方法(通过提示模型抑制截止点前知识)被视为一种潜在解决方案。我们首次系统性地检验了模拟无知能否近似真实无知状态。通过对477个竞赛级问题和9个模型的研究,我们发现模拟无知存在系统性缺陷:(1)截止点指令导致模拟无知与真实无知之间存在52%的性能差距;(2)思维链推理无法有效抑制先验知识,即使推理轨迹未包含明确的截止点后信息;(3)推理优化模型虽具有更优的推理轨迹质量,但其模拟无真的保真度反而更差。这些发现表明提示策略无法可靠地“回滚”模型知识。我们得出结论:基于截止点前事件的回顾性预测存在方法论缺陷;建议避免使用基于模拟无知的回顾性评估框架来衡量预测能力。