We propose agentic automata learning to evaluate the extent to which tool-calling LLM agents can uncover hidden environments through interaction. In our setup, an agent should uncover a hidden deterministic finite automaton (DFA) by interacting with an oracle through (1) membership queries ("Does this string belong to the target language?") and (2) equivalence queries ("Is this the target DFA?"). This yields a scalable testbed with controlled task complexity, measurable interaction efficiency, and strong baselines (classic automata-learning algorithms). Evaluating state-of-the-art LLMs, we find that performance drops sharply as DFA size increases. Reasoning models are markedly stronger than non-reasoning models, yet trajectory analyses reveal recurring failures in query planning, evidence integration, and hypothesis construction. Overall, our results show that current LLM agents can sometimes perform non-trivial interactive discovery, but remain far less robust and efficient than classic algorithms for the task.
翻译:我们提出智能自动机学习,用以评估具备工具调用能力的LLM智能体通过交互揭示隐藏环境的能力。在本文设定的场景中,智能体需通过与神谕进行两类交互来揭示隐藏的确定性有限自动机(DFA):(1) 成员查询("该字符串是否属于目标语言?")和 (2) 等价查询("该DFA是否为目标自动机?")。这构建了一个可扩展的测试平台,具备可控任务复杂度、可量化交互效率及强基线方法(经典自动机学习算法)。对当前最先进LLM的评估显示,其性能随DFA规模增大而急剧下降。推理模型的表现显著优于非推理模型,但轨迹分析揭示了其在查询规划、证据整合及假设构建环节存在重复性失败。总体而言,实验结果表明,当前LLM智能体虽能完成某些非平凡的交互式发现任务,但其在该任务中的鲁棒性与效率仍远不及经典算法。