Prompt optimization has emerged as an effective alternative to retraining for improving the performance of Large Language Models (LLMs). However, most existing approaches treat evaluation as a black box, relying solely on numerical scores while offering limited insight into why a prompt succeeds or fails. They also depend heavily on trial-and-error refinements, which are difficult to interpret and control. In this paper, we introduce MA-SAPO, a Multi-Agent framework for Score-Aware Prompt Optimization. Compared to prior methods, MA-SAPO explicitly couples evaluation outcomes with structured reasoning to guide systematic edits. The framework specifically consists of two stages: during the Reasoning Phase, agents collaboratively explain metric scores, diagnose weaknesses, and synthesize targeted refinements that are stored as reusable reasoning assets; during the Test Phase, agents retrieve these assets to analyze optimized prompts and apply only evidence-grounded edits. By turning evaluation signals into interpretable reasoning chains, MA-SAPO produces prompt refinements that are more transparent, auditable, and controllable. Experiments on the HelpSteer1/2 benchmarks demonstrate consistent improvements over single-pass prompting, retrieval-augmented baselines, and prior multi-agent strategies, validating the effectiveness of our approach.
翻译:提示优化已成为一种无需重新训练即可提升大型语言模型性能的有效替代方案。然而,现有方法大多将评估视为黑箱,仅依赖数值分数,难以深入解释提示成功或失败的原因。这些方法还严重依赖试错式改进,其过程难以解释与控制。本文提出MA-SAPO,一种面向分数感知提示优化的多智能体框架。相较于先前方法,MA-SAPO将评估结果与结构化推理显式结合,以指导系统性修改。该框架具体包含两个阶段:在推理阶段,多个智能体协作解释指标分数、诊断缺陷并综合生成针对性改进方案,这些方案作为可复用的推理资产存储;在测试阶段,智能体检索这些资产来分析优化后的提示,并仅应用基于证据的修改。通过将评估信号转化为可解释的推理链,MA-SAPO生成的提示改进方案更具透明度、可审计性与可控性。在HelpSteer1/2基准测试上的实验表明,该方法相较于单次提示、检索增强基线及先前的多智能体策略均取得了持续改进,验证了本方法的有效性。