Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunAgent, a multi-agent plan execution platform that interprets natural-language plans while enforcing stepwise execution through constraints and rubrics. RunAgent bridges the expressiveness of natural language with the determinism of programming via an agentic language with explicit control constructs (e.g., \texttt{IF}, \texttt{GOTO}, \texttt{FORALL}). Beyond verifying syntactic and semantic verification of the step output, which is performed based on the specific instruction of each step, RunAgent autonomously derives and validates constraints based on the description of the task and its instance at each step. RunAgent also dynamically selects among LLM-based reasoning, tool usage, and code generation and execution (e.g., in Python), and incorporates error correction mechanisms to ensure correctness. Finally, RunAgent filters the context history by retaining only relevant information during the execution of each step. Evaluations on Natural-plan and SciBench Datasets demonstrate that RunAgent outperforms baseline LLMs and state-of-the-art PlanGEN methods.
翻译:人类通过执行有针对性的计划来解决问题,然而大型语言模型在结构化工作流执行方面仍不可靠。我们提出RunAgent,这是一个多智能体计划执行平台,它能够解释自然语言计划,同时通过约束和评分标准强制执行逐步执行。RunAgent通过一种具有显式控制结构(如\texttt{IF}、\texttt{GOTO}、\texttt{FORALL})的智能体语言,将自然语言的表现力与编程的确定性相衔接。除了基于每个步骤的具体指令对步骤输出进行语法和语义验证外,RunAgent还能根据任务描述及其在每个步骤中的实例,自主推导并验证约束条件。RunAgent还能动态选择基于LLM的推理、工具使用以及代码生成与执行(例如Python),并集成错误纠正机制以确保正确性。最后,RunAgent通过在执行每个步骤时仅保留相关信息来过滤上下文历史。在Natural-plan和SciBench数据集上的评估表明,RunAgent优于基线LLM和最先进的PlanGEN方法。