Vision-Language Models (VLMs) enable powerful multi-agent systems, but scaling them is economically unsustainable: coordinating heterogeneous agents under information asymmetry often spirals costs. Existing paradigms, such as Mixture-of-Agents and knowledge-based routers, rely on heuristic proxies that ignore costs and collapse uncertainty structure, leading to provably suboptimal coordination. We introduce Agora, a framework that reframes coordination as a decentralized market for uncertainty. Agora formalizes epistemic uncertainty into a structured, tradable asset (perceptual, semantic, inferential), and enforces profitability-driven trading among agents based on rational economic rules. A market-aware broker, extending Thompson Sampling, initiates collaboration and guides the system toward cost-efficient equilibria. Experiments on five multimodal benchmarks (MMMU, MMBench, MathVision, InfoVQA, CC-OCR) show that Agora outperforms strong VLMs and heuristic multi-agent strategies, e.g., achieving +8.5% accuracy over the best baseline on MMMU while reducing cost by over 3x. These results establish market-based coordination as a principled and scalable paradigm for building economically viable multi-agent visual intelligence systems.
翻译:视觉语言模型(VLMs)为强大的多智能体系统提供了可能,但其规模化在经济上难以持续:在信息不对称条件下协调异构智能体往往导致成本急剧上升。现有范式,如智能体混合与基于知识的路由器,依赖于忽略成本且破坏不确定性结构的启发式代理,这导致了可证明的次优协调。我们提出了Agora框架,它将协调重新定义为不确定性的去中心化市场。Agora将认知不确定性形式化为一种结构化、可交易的资产(感知性、语义性、推理性),并基于理性经济规则强制智能体间进行利润驱动的交易。一个具有市场感知能力的经纪人,扩展了汤普森采样算法,发起协作并引导系统走向成本高效的均衡。在五个多模态基准测试(MMMU、MMBench、MathVision、InfoVQA、CC-OCR)上的实验表明,Agora优于强大的VLMs和启发式多智能体策略,例如,在MMMU上比最佳基线准确率提升+8.5%,同时成本降低超过3倍。这些结果确立了基于市场的协调作为一种原则性且可扩展的范式,用于构建经济可行的多智能体视觉智能系统。