Large language models (LLMs) have demonstrated strong performance on formal language tasks, yet whether this reflects genuine symbolic reasoning or pattern matching on familiar constructions remains unclear. We introduce a benchmark for deterministic finite automata (DFA) construction from regular languages, comprising factual knowledge questions, seen construction problems from public sources, and two types of unseen problems: hand-crafted instances with multiple interacting constraints and systematically generated problems via Arden's theorem. Models achieve perfect accuracy on factual questions and 84-90% on seen tasks. However, accuracy drops sharply on unseen problems (by 30-64%), with failures stemming from systematic misinterpretation of language constraints, incorrect handling of Kleene-star semantics, and a failure to preserve global consistency. We evaluate a three-stage hint protocol that enables correction of shallow errors but does not reliably resolve globally inconsistent or structurally flawed automata. Our analysis across multiple prompting strategies (direct, Chain-of-Thought, Tree-of-Thought) reveals that errors persist regardless of prompting approach, exposing a fundamental gap between LLMs' ability to generate syntactically plausible DFAs and their capacity for semantically correct formal reasoning.
翻译:大语言模型(LLM)在形式语言任务上已展现出强大性能,但这究竟反映了真正的符号推理能力,还是对熟悉结构的模式匹配,目前尚不明确。我们提出了一个从正则语言构建确定性有限自动机(DFA)的基准测试,包含事实性知识问题、来自公开资源的已见构造问题,以及两类未见问题:具有多重交互约束的人工设计实例,以及通过阿登定理系统生成的问题。模型在事实性问题上达到完美准确率,在已见任务上达到84-90%的准确率。然而,在未见问题上准确率急剧下降(降幅达30-64%),失败原因主要源于对语言约束的系统性误解、对Kleene星号语义的错误处理,以及未能保持全局一致性。我们评估了一种三阶段提示协议,该协议能够修正浅层错误,但无法可靠解决全局不一致或结构存在缺陷的自动机。我们对多种提示策略(直接提示、思维链、思维树)的分析表明,无论采用何种提示方法,错误依然存在,这揭示了大语言模型生成语法上合理的DFA的能力与其进行语义正确的形式推理能力之间存在根本性差距。