The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size

Inference-time multi-agent LLM scaling lacks a shared unit: counting nominal agents conflates cost with independent evidence. We derive a two-parameter scaling law $R(N) = N_\text{eff}/N = 1/(1+c(N-1)N^{-β})$ where the regime exponent $β$ classifies any configuration into one of three asymptotic regimes -- hard-ceiling at $1/c$ ($β= 0$), sublinear at $N^β/c$ ($0 < β< 1$), or linear ($β\ge 1$), and a mean-field theorem predicts that peer count $k$ and rounds $τ$ during agent debate enter the dynamics only through their product $kτ$. The law applies at two levels: answer diversity and correctness redundancy. Across 44 (model $\times$ task $\times$ condition) cells spanning peer debate, self-correction, random-noise placebo, self-consistency, three open-weight families (Qwen, Llama, Ministral) at scales from 7B to 32B with a frontier API check (Gemini), thinking models, heterogeneous teams, and sparse communication, the functional form fits every condition at $R^2 > 0.99$; only $(c, β)$ shifts. On free-form math, dense peer influence collapses the answer-level regime from sublinear into hard-ceiling; correctness-level fits remain hard-ceiling throughout. Three findings have practical implications. \emph{(i)}~Thirty dense debating agents produce no more answer diversity than one on MMLU-Hard. \emph{(ii)}~A noise placebo tracks self-correction on free-form math and at $4\times$ scale, so within homogeneous teams the gain commonly attributed to ``debate'' comes from re-evaluation, not peer content. \emph{(iii)}~A single $N \le 5$ pilot predicts the $N=30$ structural ceiling, and within the configurations tested only architectural diversity (heterogeneous teams) lowers $c$ and escapes the hard-ceiling regime, communication-mode interventions do not.

翻译：推理时多智能体大语言模型扩展缺乏共享单位：统计名义智能体数量混淆了成本与独立证据。我们推导出一个双参数缩放律 $R(N) = N_\text{eff}/N = 1/(1+c(N-1)N^{-β})$，其中状态指数 $β$ 将任何配置分类为三种渐近状态之一——硬上限 $1/c$（$β= 0$）、次线性 $N^β/c$（$0 < β< 1$）或线性（$β\ge 1$），并且平均场定理预测智能体辩论中的同伴数量 $k$ 和轮次 $τ$ 仅通过其乘积 $kτ$ 进入动力学。该定律适用于两个层次：答案多样性和正确性冗余。在跨越同伴辩论、自我修正、随机噪声安慰剂、自一致性、三个开放权重系列（Qwen、Llama、Ministral）从7B到32B规模、前沿API检查（Gemini）、思考模型、异构团队和稀疏通信的44个（模型 × 任务 × 条件）单元中，该函数形式在 $R^2 > 0.99$ 下拟合每个条件；仅 $(c, β)$ 发生变化。在自由形式数学上，密集同伴影响力将答案层次的状态从次线性崩溃为硬上限；正确性层次的拟合始终为硬上限。三个发现具有实际意义。\emph{(i)}~三十个密集辩论智能体在MMLU-Hard上产生的答案多样性并不比一个智能体多。\emph{(ii)}~噪声安慰剂在自由形式数学上以4倍规模追踪自我修正，因此在同质团队中，通常归因于“辩论”的增益来自重新评估，而非同伴内容。\emph{(iii)}~单个 $N \le 5$ 的试点预测了 $N=30$ 的结构上限，并且在测试的配置中，只有架构多样性（异构团队）能降低 $c$ 并逃离硬上限状态，通信模式干预则不能。