Prompt optimization has become a practical way to improve the performance of Large Language Models (LLMs) without retraining. However, most existing frameworks treat evaluation as a black box, relying solely on outcome scores without explaining why prompts succeed or fail. Moreover, they involve repetitive trial-and-error refinements that remain implicit, offering limited interpretability or actionable guidance for systematic improvement. In this paper, we propose MA-SAPO: a new Multi-Agent Reasoning for Score Aware Prompt Optimization framework that links evaluation outcomes directly to targeted refinements. Specifically, in the Training Phase, multiple agents interpret evaluation scores, diagnose weaknesses, and generate concrete revision directives, which are stored as reusable reasoning assets. In the Test Phase, an analyzer agent retrieves relevant exemplars and assets for a new prompt, and a refiner agent applies evidence-based edits to improve the prompt and its response. By grounding optimization in structured reasoning, MA-SAPO ensures edits are interpretable, auditable, and controllable. Experiments on the HelpSteer1/2 benchmarks show that our framework consistently outperforms single-pass prompting, retrieval-augmented generation, and prior multi-agent methods across multiple evaluation metrics.
翻译:提示优化已成为一种无需重新训练即可提升大型语言模型(LLMs)性能的实用方法。然而,现有的大多数框架将评估视为黑箱,仅依赖结果分数而不解释提示成功或失败的原因。此外,这些框架涉及重复的试错式优化,其过程隐含不清,为系统性改进提供的可解释性或可操作指导有限。本文提出了MA-SAPO:一种面向分数感知提示优化的新型多智能体推理框架,该框架将评估结果直接与有针对性的优化联系起来。具体而言,在训练阶段,多个智能体解读评估分数、诊断弱点并生成具体的修订指令,这些指令作为可复用的推理资产进行存储。在测试阶段,分析器智能体为新提示检索相关样例及资产,优化器智能体则应用基于证据的编辑来改进提示及其响应。通过将优化过程建立在结构化推理之上,MA-SAPO可确保编辑步骤具有可解释性、可审计性和可控性。在HelpSteer1/2基准测试上的实验表明,本框架在多个评估指标上始终优于单次提示、检索增强生成及先前的多智能体方法。