Finite-state reasoning, the ability to understand and implement state-dependent behavior, is central to hardware design. In this paper, we present LLM-FSM, a benchmark that evaluates how well large language models (LLMs) can recover finite-state machine (FSM) behavior from natural-language specifications and translate it into correct register transfer-level (RTL) implementations. Unlike prior specification-to-RTL benchmarks that rely on manually constructed examples, LLM-FSM is built through a fully automated pipeline. LLM-FSM first constructs FSM with configurable state counts and constrained transition structures. It then prompts LLMs to express each FSM in a structured YAML format with an application context, and to further convert that YAML into a natural-language (NL) specification. From the same YAML, our pipeline synthesizes the reference RTL and testbench in a correct-by-construction manner. All 1,000 problems are verified using LLM-based and SAT-solver-based checks, with human review on a subset. Our experiments show that even the strongest LLMs exhibit sharply declining accuracy as FSM complexity increases. We further demonstrate that training-time scaling via supervised fine-tuning (SFT) generalizes effectively to out-of-distribution (OOD) tasks, while increasing test-time compute improves reasoning reliability. Finally, LLM-FSM remains extensible by allowing its FSM complexity to scale with future model capabilities.
翻译:有限状态推理,即理解和实现状态依赖行为的能力,是硬件设计的核心。本文提出LLM-FSM,这是一个评估大语言模型从自然语言规约中恢复有限状态机行为并将其转化为正确寄存器传输级实现能力的基准测试。与以往依赖人工构建示例的规约到RTL基准不同,LLM-FSM通过全自动化流程构建。该基准首先构建具有可配置状态数和受限转移结构的有限状态机,随后引导大语言模型在应用上下文中以结构化YAML格式描述每个状态机,并进一步将YAML转换为自然语言规约。基于同一YAML文件,我们的流程以构造正确的方式合成了参考RTL代码和测试平台。全部1000个问题均通过基于大语言模型和SAT求解器的检查进行验证,并对子集进行了人工复审。实验表明,即使最强大的大语言模型在有限状态机复杂度增加时也表现出准确率的急剧下降。我们进一步证明,通过监督微调实现的训练阶段扩展能有效泛化至分布外任务,而增加测试阶段计算量则可提升推理可靠性。最后,LLM-FSM保持可扩展性,其有限状态机复杂度可随未来模型能力的提升而同步扩展。