Restart policy is an important technique used in modern Conflict-Driven Clause Learning (CDCL) solvers, wherein some parts of the solver state are erased at certain intervals during the run of the solver. In most solvers, variable activities are preserved across restart boundaries, resulting in solvers continuing to search parts of the assignment tree that are not far from the one immediately prior to a restart. To enable the solver to search possibly "distant" parts of the assignment tree, we study the effect of resets, a variant of restarts which not only erases the assignment trail, but also randomizes the activity scores of the variables of the input formula after reset, thus potentially enabling a better global exploration of the search space. In this paper, we model the problem of whether to trigger reset as a multi-armed bandit (MAB) problem, and propose two reinforcement learning (RL) based adaptive reset policies using the Upper Confidence Bound (UCB) and Thompson sampling algorithms. These two algorithms balance the exploration-exploitation tradeoff by adaptively choosing arms (reset vs. no reset) based on their estimated rewards during the solver's run. We implement our reset policies in four baseline SOTA CDCL solvers and compare the baselines against the reset versions on Satcoin benchmarks and SAT Competition instances. Our results show that RL-based reset versions outperform the corresponding baseline solvers on both Satcoin and the SAT competition instances, suggesting that our RL policy helps to dynamically and profitably adapt the reset frequency for any given input instance. We also introduce the concept of a partial reset, where at least a constant number of variable activities are retained across reset boundaries. Building on previous results, we show that there is an exponential separation between O(1) vs. $\Omega(n)$-length partial resets.
翻译:重启策略是现代冲突驱动子句学习(CDCL)求解器中的一项重要技术,即在求解器运行过程中按一定间隔擦除部分求解器状态。大多数求解器在重启边界保留变量活跃度,导致求解器持续搜索与重启前状态相差不大的赋值树区域。为使得求解器能够探索赋值树中可能"遥远"的区域,我们研究重置(restart变体)的效果——该策略不仅擦除赋值轨迹,还会在重置后随机化输入公式变量的活跃度分数,从而可能实现搜索空间的更好全局探索。本文将是否触发重置问题建模为多臂老虎机(MAB)问题,并基于上置信界(UCB)和汤普森采样算法提出两种自适应强化学习(RL)重置策略。这两种算法通过在求解器运行过程中基于估计奖励自适应选择臂(重置vs.不重置)来平衡探索-利用权衡。我们在四个基线SOTA CDCL求解器中实现重置策略,并在Satcoin基准测试和SAT竞赛实例上对比基线版本与重置版本。结果表明,基于RL的重置版本在Satcoin和SAT竞赛实例上均优于对应基线求解器,表明我们的RL策略有助于针对任意输入实例动态、有利地自适应调整重置频率。我们还引入部分重置概念,即在重置边界至少保留常数数量的变量活跃度。基于已有成果,我们证明O(1)长度与Ω(n)长度部分重置之间存在指数级分离。