Determining whether a program terminates is a central problem in computer science. Turing's foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length and complexity increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.
翻译:判断一个程序是否会终止是计算机科学中的核心问题。图灵的基础性结论确立了“停机问题”的不可判定性,表明不存在能够普遍判定所有程序和输入是否终止的算法。因此,自动验证工具只能近似地处理终止问题,有时无法证明或否证;这些工具依赖于特定问题的架构,并且通常与特定编程语言绑定。大型语言模型的近期成功与进展提出了以下问题:LLMs 能否可靠地预测程序终止?在本工作中,我们在国际软件验证竞赛 2025 年“终止”类别中的多样化程序集上评估了 LLMs。我们的结果表明,LLMs 在预测程序终止方面表现异常出色,其中 GPT-5 和 Claude Sonnet-4.5 的成绩将仅次于排名第一的工具(采用测试时扩展策略),而 Code World Model (CWM) 将紧随排名第二的工具之后。尽管 LLMs 在预测程序终止方面有效,但它们往往无法提供有效的验证作为证明。此外,随着程序长度和复杂度增加,LLMs 的性能会下降。我们期望这些见解能激励对程序终止问题以及 LLMs 在推理不可判定问题方面更广泛潜力的进一步研究。