Know Your Limits : On the Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning

from arxiv, 10 pages, submitted to COLM 2026 (under review, average score of 6.25 across 4 reviewers) and accepted by the AI4Law workshop at ICML. This is the version where we already addressed most of the reviews from the COLM reviewers

Large Language Models (LLMs) achieve strong performance on reasoning tasks, but whether this reflects faithful logical inference or heuristic approximation remains unclear. We study this question in legal entailment by comparing three paradigms, including pure LLM classification, LLM-based Formal Reasoning, and solver-based Formal Reasoning using the Z3 SMT solver, on a re-annotated subset of ContractNLI across five LLMs. Our re-annotation reveals a systematic and measurable gap between pragmatic legal interpretation and strict formal entailment, where a substantial proportion of legally sound inferences are not formally grounded without additional unstated assumptions. While introducing formal structure improves accuracy, with LLM-based Formal Reasoning achieving the highest benchmark performance, we show that this gain does not imply faithful reasoning. We identify three recurring failure modes: scope laundering, where LLMs report solver-inconsistent classifications without executing the underlying formal reasoning, producing conclusions that appear logically grounded but are not; implicit constraint blindness, where LLMs overlook logical constraints present in formal representations; and program synthesis failures, where LLMs generate incorrect Z3 code despite structured prompting. Critically, scope laundering persists across all models, raising serious concerns about the faithfulness of LLM-based formal reasoning as a proxy for symbolic execution. These results reveal a fundamental gap between benchmark accuracy and logical faithfulness.

翻译：大型语言模型（LLMs）在推理任务中表现出色，但其表现究竟源于忠实的逻辑推理还是启发式近似，目前尚不明确。本研究通过在法律蕴涵任务中对比三种范式——纯LLM分类、基于LLM的形式推理以及基于Z3 SMT求解器的求解形式推理——并在重新标注的ContractNLI子集上对五种LLM进行了测试。重新标注揭示出实用法律解释与严格形式蕴涵之间存在系统且可量化的鸿沟：大量在法律上合理的推理若不引入额外未明示假设，则缺乏形式化基础。尽管引入形式化结构能提升准确率（基于LLM的形式推理达到了最高的基准性能），但我们证明这种提升并不等同于忠实推理。我们识别出三种反复出现的失败模式：范围清洗（LLM在未执行底层形式推理的情况下报告与求解器不一致的分类结果，从而生成看似逻辑严谨实则不然的结论）；隐含约束盲视（LLM忽视形式化表示中存在的逻辑约束）；以及程序综合失败（LLM即便在结构化提示下仍生成错误的Z3代码）。尤为关键的是，范围清洗现象在所有模型中持续存在，这严重质疑了基于LLM的形式推理作为符号执行代理的忠实性。这些结果表明，基准准确率与逻辑忠实性之间存在根本性脱节。