Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains unclear. We present a systematic study of LLMs on 2-SAT and 3-SAT, together with two canonical reductions, Vertex Cover and discrete 3D packing, to probe representation-invariant reasoning. We first evaluate models using conventional metrics, including accuracy, precision, recall, and F1, as well as the SAT phase-transition setting. We find that these metrics can be misleading: many models obtain high scores by over-predicting satisfiable formulas, fail to reproduce the classical easy-hard-easy signature around the 3-SAT threshold, and degrade sharply as the number of variables grows. To address this problem, we introduce a paired-formula protocol based on minimally different satisfiable and unsatisfiable instances, together with Accurate Differentiation Rate (ADR), which requires both members of each pair to be classified correctly. ADR separates reasoning-oriented models from heuristic ones and correlates with witness validity. Beyond CNF, we test cross-representation consistency by converting CNF to Vertex Cover and 3-SAT to discrete 3D packing. Model decisions on CNF and on the corresponding graph or packing instances agree for most models on more than 80 percent of instances, suggesting stable decision rules across representations. Overall, our results show that SAT is a conservative probe for LLM reasoning, and that paired evaluation with ADR provides a more faithful and representation-robust assessment than conventional metrics.
翻译:大语言模型(LLMs)被越来越多地用于隐式可归结为布尔可满足性(SAT)的任务,但它们在SAT问题上的推理能力仍不明确。我们对LLMs在2-SAT和3-SAT问题上的表现进行了系统性研究,同时结合两种经典归约问题——顶点覆盖和离散三维装箱,以探究其表示无关的推理能力。我们首先采用传统指标(包括准确率、精确率、召回率和F1分数)以及SAT相变设置进行模型评估。结果发现这些指标可能具有误导性:许多模型通过过度预测可满足公式获得高分,却无法复现3-SAT阈值附近经典的易-难-易特征,且随着变量数量增加性能急剧下降。针对该问题,我们提出了一种基于最小差异可满足与不可满足实例的配对公式协议,并引入精确区分率(ADR)指标,该指标要求每对实例的两个成员均被正确分类。ADR能有效区分推理导向型模型与启发式模型,且与证据有效性相关。在CNF之外,我们通过将CNF归约为顶点覆盖、将3-SAT归约为离散三维装箱来测试跨表示一致性。多数模型在超过80%的实例上对CNF及其对应图或装箱实例的决策结果一致,表明存在跨表示的稳定决策规则。总体而言,我们的研究结果表明SAT是LLM推理的保守探针,而基于ADR的配对评估相比传统指标能够提供更可靠且对表示形式更鲁棒的评价。