LLM agents in interactive environments are highly sensitive to their prompts, yet prompt engineering remains a manual, task-specific process. We introduce an automated prompt optimization framework for LLM agents that decomposes the observation-to-action pipeline into a goal-conditioned descriptor agent and an action selection agent, and iteratively refines each module's prompt through an LLM-driven evolutionary loop guided by environment returns. We propose a behavior analyzer to attribute episode outcomes to specific prompt components, and a mutator to propose targeted revisions to the prompt, before validating them through environment rollouts. We evaluate on all five BabyAI tasks in the BALROG benchmark, comparing our pipeline against BALROG's RobustCoTAgent under both plain and guided prompt initializations. Optimization improves performance consistently across tasks and conditions, without requiring updates to the model weights. On PutNext, a multi-step coordination task where the RobustCoTAgent achieves 0% success, our framework reaches up to 72.5% success rate using the same underlying LLM with optimized prompts. These results suggest that a multi-agent framework, combined with automatic prompt optimization, enhances LLMs without the need for fine-tuning or extensive human supervision.
翻译:交互环境中的大语言模型(LLM)智能体对其提示词高度敏感,但提示工程仍是一个依赖人工的、任务特定的过程。我们提出了一种面向LLM智能体的自动提示词优化框架,该框架将观察-行动流程分解为目标条件化描述智能体与行动选择智能体两个模块,并通过基于环境回报的LLM驱动进化循环逐步优化每个模块的提示词。我们设计了行为分析器用于将回合结果归因至特定提示词组件,以及变异器用于提出针对性的提示词修改方案,再通过环境模拟对改进效果进行验证。在BALROG基准测试的全部五个BabyAI任务中,我们分别采用朴素提示词初始化和引导式提示词初始化,将本框架与BALROG的RobustCoTAgent进行对比。结果表明,在无需更新模型权重的情况下,优化后的提示词在所有任务和条件下均能持续提升性能。在RobustCoTAgent成功率为0%的多步协调任务PutNext中,采用相同底层LLM的本框架通过优化提示词实现了高达72.5%的成功率。这些结果表明,多智能体框架与自动提示词优化的结合,能够在不进行微调或大量人工监督的情况下增强LLM的能力。