Large Language Models (LLMs) have recently empowered agentic frameworks to exhibit advanced reasoning and planning capabilities. However, their integration in robotic control pipelines remains limited in two aspects: (1) prior \ac{llm}-based approaches often lack modular, agentic execution mechanisms, limiting their ability to plan, reflect on outcomes, and revise actions in a closed-loop manner; and (2) existing benchmarks for manipulation tasks focus on low-level control and do not systematically evaluate multistep reasoning and linguistic variation. In this paper, we propose Agentic LLM for Robot Manipulation (ALRM), an LLM-driven agentic framework for robotic manipulation. ALRM integrates policy generation with agentic execution through a ReAct-style reasoning loop, supporting two complementary modes: Code-asPolicy (CaP) for direct executable control code generation, and Tool-as-Policy (TaP) for iterative planning and tool-based action execution. To enable systematic evaluation, we also introduce a novel simulation benchmark comprising 56 tasks across multiple environments, capturing linguistically diverse instructions. Experiments with ten LLMs demonstrate that ALRM provides a scalable, interpretable, and modular approach for bridging natural language reasoning with reliable robotic execution. Results reveal Claude-4.1-Opus as the top closed-source model and Falcon-H1-7B as the top open-source model under CaP.
翻译:大语言模型(LLM)近期赋能智能体框架,展现出高级推理与规划能力。然而,其在机器人控制流程中的集成仍存在两方面局限:(1)现有基于LLM的方法通常缺乏模块化、智能体化的执行机制,限制了其以闭环方式进行规划、结果反思与动作修正的能力;(2)现有操作任务基准主要关注底层控制,未能系统评估多步推理与语言多样性。本文提出面向机器人操作的智能体化大语言模型(ALRM),这是一种基于LLM驱动的机器人操作智能体框架。ALRM通过ReAct式推理循环将策略生成与智能体执行相融合,支持两种互补模式:用于直接生成可执行控制代码的“代码即策略”(CaP)模式,以及用于迭代规划和基于工具的动作执行的“工具即策略”(TaP)模式。为实现系统化评估,我们同时提出一个包含跨多环境56项任务的新型仿真基准,该基准涵盖语言多样化的指令。基于十种LLM的实验表明,ALRM为连接自然语言推理与可靠机器人执行提供了一种可扩展、可解释且模块化的方法。结果显示,在CaP模式下,Claude-4.1-Opus为表现最优的闭源模型,Falcon-H1-7B为表现最优的开源模型。