Real-world LLM applications are moving beyond single-agent workflows toward orchestrated multi-agent systems, yet current models still struggle to determine what each sub-agent needs to know. To measure this, we introduce PerspectiveGap, a benchmark for evaluating LLMs' ability to compose orchestration prompts for multi-agent systems. PerspectiveGap contains 110 scenarios, each evaluated through two distractor-mixed task formats: role-fragment assignment and free-form prompt writing. These scenarios are organized into 10 topologies, which are distilled from the authors' real-world engineering practice and framed by the Prompt Economy principle: building loop-centered orchestrations that maximize utility with minimal role and engineering overhead. In experiments with 27 commercial models from 10 companies, GPT-5.5 substantially outperforms all competitors, whereas Opus 4.7 shows a notable weakness in orchestration prompting despite its strong coding performance. Nevertheless, PerspectiveGap remains challenging: the evaluated models achieve an average combined pass rate of only 14.9\% (GPT-5.5 62.0\%) and an average overall leakage rate of 246.5\% (a per-scenario information leak-event count, not a proportion; GPT-5.5 49.1\%). These findings suggest that multi-agent orchestration prompting is a distinct and under-evaluated capability, and PerspectiveGap provides a foundation for measuring and improving it systematically.
翻译:现实世界的LLM应用正从单智能体工作流转向编排式多智能体系统,然而当前模型仍难以确定每个子智能体需要知晓的信息。为量化这一能力,我们提出视角鸿沟(PerspectiveGap)——一个评估LLM为多智能体系统编写编排提示能力的基准。该基准包含110个场景,每个场景通过两种含干扰项的混合任务形式进行评估:角色片段分配与自由格式提示编写。这些场景按10种拓扑结构组织,拓扑源自作者的真实工程实践并遵循提示经济原则:构建以循环为中心、用最小角色与工程开销实现最大效用的编排方案。在来自10家公司的27个商业模型实验中,GPT-5.5显著优于所有竞争对手,而Opus 4.7虽在编码任务中表现强劲,却在编排提示方面显现明显短板。尽管如此,视角鸿沟仍具挑战性:评估模型平均综合通过率仅14.9%(GPT-5.5为62.0%),平均整体泄漏率达246.5%(该指标为每个场景的信息泄露事件计数,非比例值;GPT-5.5为49.1%)。这些发现表明,多智能体编排提示是一项独特且尚未被充分评估的能力,而视角鸿沟为其系统性评估与改进提供了基础。