An emerging line of recent work advocates for using large language models (LLMs) as formalizers instead of as end-to-end solvers for various types of problems. Instead of generating the solution, the LLM generates a formal program that derives a solution via an external solver. We thoroughly investigate the formalization capability of LLMs on real-life constraint satisfaction problems. On 4 domains, we systematically evaluate 6 LLMs, including 4 large reasoning models with inference-time scaling, paired with 5 pipelines, including 2 types of formalism. We show that in zero-shot settings, LLM-as-formalizer performs on par with the mainstream LLM-as-solver while offering verifiability, interpretability, and robustness. We also observe excessive reasoning tokens and hard-coded solutions scaling with problem complexity, which demonstrates that even the state-of-the-art LLMs have limited ability to generate solutions or formal programs. We present our detailed analysis and actionable remedies to drive future research that improves LLM-as-formalizer.
翻译:近期一系列新兴研究主张将大型语言模型(LLM)用作形式化工具,而非作为各类问题的端到端求解器。LLM不直接生成解决方案,而是生成可通过外部求解器推导出解的形式化程序。本文针对现实场景中的约束满足问题,系统研究了LLM的形式化能力。我们在4个领域中对6种LLM(包括4个具备推理时扩展能力的大型推理模型)进行了系统评估,并结合5种处理流程(涵盖2种形式化范式)进行实验。研究表明:在零样本设置下,LLM作为形式化工具的性能与主流的LLM端到端求解器相当,同时具备可验证性、可解释性与鲁棒性。我们还观察到,随着问题复杂度提升,模型会产生过量的推理标记和硬编码解决方案,这表明即使最先进的LLM在生成解决方案或形式化程序方面仍存在能力局限。本文通过详细分析与可操作的改进方案,为推动提升LLM形式化能力的前沿研究提供参考。