Determining whether a program terminates is a central problem in computer science. Turing's foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures and abstractions, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of C programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.
翻译:判定程序是否终止是计算机科学的核心问题。图灵的基础性结果确立了停机问题的不可判定性,表明不存在能普遍判定所有程序及输入终止性的算法。因此,自动验证工具通过近似方法处理终止性问题,有时无法证明或证伪;这些工具依赖于特定问题的架构与抽象,且通常与特定编程语言绑定。大型语言模型(LLMs)近期的成功与进展引出了以下问题:LLMs能否可靠地预测程序终止?本研究基于国际软件验证竞赛(SV-Comp)2025终止性类别的多样化C程序集对LLMs进行评估。结果表明,LLMs在程序终止预测方面表现优异:GPT-5与Claude Sonnet-4.5(采用测试时扩展策略)的排名将仅次于榜首工具,而Code World Model(CWM)的排名将紧随第二名工具之后。尽管LLMs能有效预测程序终止,但其常无法提供有效的证明凭证。此外,随着程序长度增加,LLMs的性能会下降。我们希望这些发现能推动关于程序终止性及LLMs在不可判定问题推理方面更广泛潜力的深入研究。