We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove. Our code and leaderboard available at https:/llmwikirace.github.io.
翻译:我们提出了LLM-Wikirace基准测试,用于评估大语言模型的规划、推理和世界知识能力。在该任务中,模型需从给定源页面出发,通过维基百科超链接逐步导航至目标页面,这要求其具备前瞻性规划能力以及推理现实世界中概念间关联的能力。我们评估了涵盖开源与闭源的广泛模型,其中Gemini-3、GPT-5和Claude Opus 4.5在简单级别任务上表现最佳,展现出超人类水平。然而,在困难级别上性能急剧下降:最优模型Gemini-3在困难游戏中仅成功23%,凸显了前沿模型仍面临的重大挑战。分析表明,世界知识是成功的必要因素,但仅在一定阈值内有效;超过该阈值后,规划与长程推理能力成为主导因素。轨迹级分析进一步揭示,即使最强模型在失败后也难以重新规划,常陷入循环而非恢复。LLM-Wikirace作为简洁的基准测试,揭示了当前推理系统的明显局限,为具备规划能力的LLM提供了尚未完全证明自身实力的开放竞技场。我们的代码和排行榜见https:/llmwikirace.github.io。