Agentic workflows, where multiple AI agents collaborate to accomplish complex tasks like reasoning or planning, play a substantial role in many cutting-edge commercial applications, and continue to fascinate researchers across fields for their potential to accomplish expensive, complex tasks that, until recently, only humans have been trusted to do. These workflows depend critically on the prompts used to provide the roles models play in such workflows. Poorly designed prompts that fail even slightly to guide individual agents can lead to sub-optimal performance that may snowball within a system of agents, limiting their reliability and scalability. To address this important problem of inference-time prompt optimization, we introduce ProRefine, an innovative inference-time optimization method that uses an agentic loop of LLMs to generate and apply textual feedback. ProRefine dynamically refines prompts for multi-step reasoning tasks without additional training or ground truth labels. Evaluated on five benchmark mathematical reasoning datasets, ProRefine significantly surpasses zero-shot Chain-of-Thought baselines by 3 to 37 percentage points. This approach not only boosts accuracy but also allows smaller models to approach the performance of their larger counterparts. This highlights its potential for building more cost-effective and powerful hybrid AI systems, thereby democratizing access to high-performing AI.
翻译:智能体工作流,即多个AI智能体协作完成推理或规划等复杂任务,在许多前沿商业应用中发挥着重要作用,并因其能够完成以往仅由人类承担的高成本、复杂任务的潜力,持续吸引着跨领域研究者的关注。这些工作流的关键在于用于定义模型中各角色职能的提示词设计。若提示词设计不当,即使仅存在轻微缺陷,也可能导致单个智能体表现欠佳,进而在多智能体系统中产生连锁反应,从而限制系统的可靠性与可扩展性。为解决这一推理时提示优化的关键问题,本文提出ProRefine——一种创新的推理时优化方法,通过构建大语言模型的智能体循环来生成并应用文本反馈。ProRefine能够在不依赖额外训练数据或真实标签的情况下,动态优化多步推理任务的提示词。在五个数学推理基准数据集上的评估表明,ProRefine显著超越零样本思维链基线方法3至37个百分点。该方法不仅提升了推理精度,还能使较小规模的模型逼近大型模型的性能,这凸显了其在构建更具成本效益与强大能力的混合人工智能系统方面的潜力,从而推动高性能人工智能技术的普惠化发展。