Despite the effectiveness of Cube-and-Conquer (C&C) for solving challenging Boolean Satisfiability (SAT) problems, no prior work has shown that transformer-based models can learn effective cubing heuristics. We introduce a neuro-symbolic post-training framework for this task. We design an MCTS-based data curation pipeline that uses symbolic heuristics to explore splitting decisions over SAT competition formulas, producing preference data grounded in solver statistics and augmented with reasoning traces from a teacher model. Our two-stage post-training, supervised fine-tuning (SFT) followed by direct preference optimization (DPO), enables a 4B-parameter model to achieve a pass@5 score of 53 on 100 SAT competition benchmarks, surpassing frontier LLMs such as Claude-Sonnet-4 (50) and matching the best symbolic heuristic (53). Ablations show that SFT alone improves pass@5 from 46 to 51, with DPO adding 2 additional benchmarks; an entropy/agreement ablation on realized first-cube decisions further shows that SFT, not DPO, accounts for the root-level decision diversity that produces complementary per-run coverage over deterministic symbolic methods. This demonstrates that transformers can be trained to make effective cubing decisions in a domain traditionally dominated by symbolic methods.
翻译:尽管“立方与征服”(Cube-and-Conquer,C&C)策略在解决具有挑战性的布尔可满足性(SAT)问题中表现出色,但此前尚无研究证明基于Transformer的模型能学习到有效的立方法则。我们为此任务提出了一种神经符号后训练框架。我们设计了一条基于蒙特卡洛树搜索(MCTS)的数据筛选流水线,利用符号式启发法探索SAT竞赛公式中的分裂决策,生成基于求解器统计数据的偏好数据,并通过教师模型的推理轨迹进行增强。我们的两阶段后训练——监督微调(SFT)后接直接偏好优化(DPO)——使一个4B参数模型在100个SAT竞赛基准测试中达到53的pass@5得分,超越前沿大语言模型如Claude-Sonnet-4(50分),并与最佳符号式启发法持平(53分)。消融实验表明,仅SFT便可将pass@5从46提升至51,而DPO额外增加2个基准测试;基于实际首次立方决策的熵/一致性消融实验进一步揭示,根级决策多样性源自SFT而非DPO,这种多样性对确定性符号方法产生了互补的逐轮覆盖效果。这证明Transformer能够在传统由符号方法主导的领域中通过训练做出有效的立方法则决策。