Can AI reason about a war before its trajectory becomes historically obvious? Analyzing this capability is difficult because retrospective geopolitical prediction is heavily confounded by training-data leakage. We address this challenge through a temporally grounded case study of the early stages of the 2026 Middle East conflict, which unfolded after the training cutoff of current frontier models. We construct 11 critical temporal nodes, 42 node-specific verifiable questions, and 5 general exploratory questions, requiring models to reason only from information that would have been publicly available at each moment. This design substantially mitigates training-data leakage concerns, creating a setting well-suited for studying how models analyze an unfolding crisis under the fog of war, and provides, to our knowledge, the first temporally grounded analysis of LLM reasoning in an ongoing geopolitical conflict. Our analysis reveals three main findings. First, current state-of-the-art large language models often display a striking degree of strategic realism, reasoning beyond surface rhetoric toward deeper structural incentives. Second, this capability is uneven across domains: models are more reliable in economically and logistically structured settings than in politically ambiguous multi-actor environments. Finally, model narratives evolve over time, shifting from early expectations of rapid containment toward more systemic accounts of regional entrenchment and attritional de-escalation. Since the conflict remains ongoing at the time of writing, this work can serve as an archival snapshot of model reasoning during an unfolding geopolitical crisis, enabling future studies without the hindsight bias of retrospective analysis.
翻译:人工智能能否在战争轨迹变得历史性明显之前对其进行推理?分析这种能力是困难的,因为回顾性的地缘政治预测严重受到训练数据泄露的干扰。我们通过对2026年中东冲突早期阶段进行一项时间上锚定的案例研究来应对这一挑战,该冲突发生在当前前沿模型的训练截止日期之后。我们构建了11个关键时间节点、42个节点特定的可验证问题以及5个一般性探索问题,要求模型仅基于每个时刻公开可获得的信息进行推理。这种设计极大地缓解了训练数据泄露的担忧,创造了一个非常适合研究模型如何在战争迷雾下分析一场正在展开的危机的环境,并且据我们所知,首次提供了对大型语言模型在持续地缘政治冲突中推理能力的时间锚定分析。我们的分析揭示了三个主要发现。首先,当前最先进的大型语言模型常常表现出惊人的战略现实主义程度,其推理超越了表面言辞,指向更深层次的结构性动因。其次,这种能力在不同领域分布不均:模型在经济和后勤结构化的环境中比在政治模糊的多行为体环境中更为可靠。最后,模型的叙事随时间演变,从早期对快速遏制的预期转向对区域固守和消耗性降级的更具系统性的解释。由于撰写本文时冲突仍在持续,这项工作可以作为模型在展开的地缘政治危机期间推理能力的档案快照,使未来的研究能够避免回顾性分析的后见之明偏差。