Multi-agent systems based on large language models (LLMs) for financial trading have grown rapidly since 2023, yet the field lacks a shared framework for understanding what drives performance or for evaluating claims credibly. This survey makes three contributions. First, we introduce a four-dimensional taxonomy, covering architecture pattern, coordination mechanism, memory architecture, and tool integration; applied to 12 multi-agent systems and two single-agent baselines. Second, we formulate the Coordination Primacy Hypothesis (CPH): inter-agent coordination protocol design is a primary driver of trading decision quality, often exerting greater influence than model scaling. CPH is presented as a falsifiable research hypothesis supported by tiered structural evidence rather than as an empirically validated conclusion; its definitive validation requires evaluation infrastructure that does not yet exist in the field. Third, we document five pervasive evaluation failures (look-ahead bias, survivorship bias, backtesting overfitting, transaction cost neglect, and regime-shift blindness) and show that these can reverse the sign of reported returns. Building on the CPH and the evaluation critique, we introduce the Coordination Breakeven Spread (CBS), a metric for determining whether multi-agent coordination adds genuine value net of transaction costs, and propose minimum evaluation standards as prerequisites for validating the CPH.
翻译:摘要:基于大语言模型(LLM)的金融交易多智能体系统自2023年以来发展迅速,但该领域缺乏一个共同的框架来理解驱动性能的因素或可靠地评估相关主张。本综述做出三项贡献。首先,我们引入一个四维分类体系,涵盖架构模式、协调机制、记忆架构和工具集成,并应用于12个多智能体系统及两个单智能体基线。其次,我们提出协调优先性假说(CPH):智能体间协调协议的设计是交易决策质量的首要驱动力,其影响往往超过模型规模扩展。CPH作为可证伪的研究假说提出,以分层结构证据支持,而非经验验证的结论;其最终验证需要领域内尚未建立的评估基础设施。第三,我们记录五种普遍存在的评估失效(前瞻偏差、幸存者偏差、回测过拟合、交易成本忽略及制度转换盲区),并表明这些失效可能扭转所报告收益的符号。基于CPH与评估批判,我们引入协调盈亏平衡点差(CBS)这一指标,用于判断多智能体协调在扣除交易成本后是否产生真实价值,并提议最低评估标准作为验证CPH的前提条件。