Determining whether a program terminates is a central problem in computer science. Turing's Halting Problem established termination as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Hence, verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem specific architectures, and are usually tied to particular programming languages. Recent advances in LLMs raise a natural question: To what extent can they reason about program termination? We evaluate frontier LLMs on a diverse set of C programs from the International Competition on Software Verification (SV Comp) 2025. Our results show that GPT-5 and Claude Sonnet 4.5 achieve scores comparable to top ranked verification tools (with test time scaling). However, while models often correctly infer whether programs terminate, they frequently fail to construct a witness as formal proof, revealing a gap between semantic recognition and symbolic proof generation. Performance further degrades as code length increases. To analyze this gap, we introduce a divergence precondition formulation that characterizes non termination conditions as logical constraints. We hope these findings motivate future research on real-world termination benchmarks, neuro-symbolic approaches that combine LLMs with symbolic verification methods, and, more broadly LLM reasoning on other undecidable problems.
翻译:判断程序是否终止是计算机科学中的一个核心问题。图灵的停机问题确立了终止性的不可判定性,表明不存在能普遍判断所有程序及其输入是否终止的算法。因此,验证工具对终止性进行近似判断,有时无法证明或否证;这些工具依赖于特定问题的架构,且通常与特定编程语言绑定。LLMs的最新进展引发了一个自然问题:它们在多大程度上能够推理程序的终止性?我们在国际软件验证竞赛(SV Comp)2025的多样化C语言程序集合上评估了前沿LLMs。结果表明,GPT-5和Claude Sonnet 4.5取得了与顶级验证工具相当的分数(在测试时扩展计算下)。然而,尽管模型通常能正确推断程序是否终止,它们却频繁无法构建作为形式化证明的证据,揭示了语义识别与符号证明生成之间的差距。性能随代码长度增加而进一步下降。为分析这一差距,我们引入了一种分歧前提形式化方法,将非终止条件表述为逻辑约束。我们希望这些发现能推动未来在现实世界终止性基准、结合LLMs与符号验证方法的神经符号方法,以及更广泛地,LLMs在其他不可判定问题上的推理等方面的研究。