As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.
翻译:随着语言模型日益被部署于复杂的自主任务,它们在更长推理跨度上准确推理的能力变得至关重要。这种能力的核心要素是规划和管理一个长而复杂的思维链(CoT)。我们提出了LongCoT,一个包含2500个专家设计问题的可扩展基准测试,涵盖化学、数学、计算机科学、国际象棋和逻辑学领域,旨在单独且直接地测量前沿模型的长程思维链推理能力。问题由简短输入和可验证答案组成;求解这些问题需要导航一个包含数十至数十万个推理令牌的相互依赖步骤的图。每个局部步骤对于前沿模型而言都是可单独处理的,因此失败反映了长程推理的局限性。在发布时,最佳模型在LongCoT上的准确率低于10%(GPT 5.2:9.8%;Gemini 3 Pro:6.1%),揭示了当前能力的显著差距。总体而言,LongCoT为长程推理提供了严格的测量标准,追踪了前沿模型在扩展推理跨度上可靠推理的能力。