Progress in hardware model checking depends critically on high-quality benchmarks. However, the community faces a significant benchmark gap: existing suites are limited in number, often distributed only in representations such as BTOR2 without access to the originating register-transfer-level (RTL) designs, and biased toward extreme difficulty where instances are either trivial or intractable. These limitations hinder rigorous evaluation of new verification techniques and encourage overfitting of solver heuristics to a narrow set of problems. To address this, we introduce EvolveGen, a framework for generating hardware model checking benchmarks by combining reinforcement learning (RL) with high-level synthesis (HLS). Our approach operates at an algorithmic level of abstraction in which an RL agent learns to construct computation graphs. By compiling these graphs under different synthesis directives, we produce pairs of functionally equivalent but structurally distinct hardware designs, inducing challenging model checking instances. Solver runtime is used as the reward signal, enabling the agent to autonomously discover and generate small-but-hard instances that expose solver-specific weaknesses. Experiments show that EvolveGen efficiently creates a diverse benchmark set in standard formats (e.g., AIGER and BTOR2) and effectively reveals performance bottlenecks in state-of-the-art model checkers.
翻译:硬件模型检验的进展关键依赖于高质量的基准测试集。然而,该领域面临显著的基准缺口:现有测试套件数量有限,通常仅以BTOR2等表示形式分发而无法获取原始寄存器传输级(RTL)设计,且偏向极端难度——实例要么过于简单要么无法求解。这些限制阻碍了对新验证技术的严格评估,并导致求解器启发式方法过度适应于狭窄的问题集。为解决此问题,我们提出EvolveGen,一个通过结合强化学习(RL)与高层次综合(HLS)来生成硬件模型检验基准的框架。我们的方法在算法抽象层面运行,其中RL智能体学习构建计算图。通过在不同综合指令下编译这些图,我们生成功能等效但结构各异的硬件设计对,从而产生具有挑战性的模型检验实例。求解器运行时间被用作奖励信号,使智能体能够自主发现并生成暴露特定求解器弱点的“小而难”实例。实验表明,EvolveGen能以标准格式(如AIGER和BTOR2)高效创建多样化的基准集,并有效揭示先进模型检验器的性能瓶颈。