Determining whether a program terminates is a central problem in computer science. Turing's foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures and abstractions, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of C programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.
翻译:判定程序是否终止是计算机科学的核心问题。图灵的基础性结果确立了停机问题的不可判定性,表明不存在能够普遍判定所有程序及输入终止性的算法。因此,自动验证工具通过近似方法处理终止性问题,有时无法证明或证伪;这些工具依赖于特定问题的架构与抽象,且通常与特定编程语言绑定。大型语言模型(LLMs)近年取得的成功与进展引出了以下问题:LLMs能否可靠地预测程序终止?在本工作中,我们使用国际软件验证竞赛(SV-Comp)2025年终止性类别中的多样化C程序集评估了LLMs。结果表明,LLMs在程序终止预测方面表现优异:GPT-5与Claude Sonnet-4.5(采用测试时扩展策略)的排名将仅次于榜首工具,而Code World Model(CWM)将位列第二。尽管LLMs能有效预测程序终止,却常无法提供有效的证明见证。此外,LLMs的性能随程序长度增加而下降。我们希望这些发现能推动程序终止性研究及LLMs在不可判定问题推理方面更广泛潜力的探索。