Recent work shows superior performance when using large language models (LLMs) as formalizers instead of as end-to-end solvers for symbolic reasoning problems. Given the problem description, the LLM generates a formal program that derives a solution via an external solver. We systematically investigate the formalization capability of LLMs on real-life constraint satisfaction problems on 4 benchmarks, 6 LLMs, and 2 types of formal languages. We show that LLM-as-formalizer by no means trivializes the problem but underperforms LLM-as-solver in 15 out of 24 model-dataset combinations, despite the former's verifiability and interpretability. Although the formalization space is magnitudes smaller than the search space, our scaling analysis shows that LLM-as-formalizer still drastically degrades as problem complexity increases similar to LLM-as-solver. To better understand this limitation, we observe excessive, solver-like reasoning tokens that sometimes lead to hard-coded solutions, highlighting a key challenge for improving LLM-based formalization.
翻译:近期研究表明,在符号推理问题中,使用大型语言模型(LLM)作为形式化工具而非端到端求解器时,性能表现更为优越。给定问题描述后,LLM生成的形式化程序可通过外部求解器导出解决方案。我们针对4个基准、6种LLM和2种形式语言,系统研究了LLM在现实约束满足问题中的形式化能力。研究表明,LLM-形式化工具并未使问题变得简单,相反在24个模型-数据集组合中有15个的表现不及LLM-求解器,尽管前者具备可验证性和可解释性。尽管形式化空间远小于搜索空间,但我们的规模分析显示,随着问题复杂度增加,LLM-形式化工具的性能退化程度与LLM-求解器类似。为了深入理解这一局限,我们观察到大量类似求解器的推理标记,这些标记有时会导致硬编码解决方案,凸显了改进基于LLM形式化方法的关键挑战。