How Generation Architecture Shapes Code Complexity in Multi-Agent LLM Systems: A Paired Study on HumanEval

Large-language-model code generation has shifted from single-shot prompting to multi-agent orchestrations - analyst, coder, tester, and debugger pipelines - and is evaluated almost exclusively on functional correctness. Whether these architectures also affect the structural complexity of the code they produce, and which orchestration layers carry the cost, remains largely unexamined: prior work has documented prompt-level effects on code complexity, but the architecture-level question is open. We compare six widely-used multi-agent configurations (Basic, AC, ACT, Debugger, AC+Debugger, ACT+Debugger) under two models from the GPT-4o family across all 164 HumanEval tasks - 1,968 paired observations - using the five RADON complexity metrics (SLOC, cyclomatic complexity, and Halstead Volume, Difficulty, and Effort). We apply a paired non-parametric statistical pipeline (Friedman omnibus, Wilcoxon signed-rank post-hoc with Holm correction, Kendall's $W$ and matched-pairs rank-biserial effect sizes) in both all-completions and passing-only conditions. The six architectures collapse into two indistinguishable complexity clusters separated by a 50-130% gap, the same partition in both models and under both conditions; among the architectural layers, the analyst-coder split inflates complexity, the runtime debugger does not - and on the analyst-coder background actively deflates it - and the tester re-inflates it. The heavy cluster's additional complexity buys no pass@1 advantage: the leanest architectures match or beat the heaviest on accuracy. Architectural elaboration in LLM code generation should therefore be justified by measured benefit on the dimensions that matter, not assumed.

翻译：大语言模型代码生成已从单次提示转向多智能体编排——分析师、编码器、测试器和调试器流水线——且几乎完全依据功能正确性进行评估。这些架构是否会影响所生成代码的结构复杂度，以及哪些编排层承担了相应代价，目前仍缺乏系统研究：先前工作已记录了提示层面对代码复杂度的影响，但架构层面的问题尚待解答。我们对比了六种广泛使用的多智能体配置（Basic、AC、ACT、Debugger、AC+Debugger、ACT+Debugger），采用GPT-4o系列的两个模型，在全部164个HumanEval任务上进行了1,968次配对观察，并使用五项RADON复杂度指标（源语句行数、圈复杂度以及Halstead Volume、Difficulty、Effort）。我们在全部完成条件和仅通过条件下均应用了配对非参数统计流程（Friedman全局检验、带Holm校正的Wilcoxon符号秩事后检验、Kendall's $W$和配对秩双列效应量）。六种架构坍缩为两个难以区分的复杂度簇，二者差距为50-130%，该划分在两个模型和两种条件下均一致；在各架构层中，分析员-编码员划分提升了复杂度，运行时调试器并未提升——且在分析员-编码员背景下反而主动降低复杂度——而测试器则重新提升复杂度。高复杂度簇的额外复杂度并未带来pass@1优势：最精简架构在准确率上与最复杂架构持平甚至更优。因此，大语言模型代码生成中的架构细化应基于对关键维度的实测收益进行论证，而非预设其有效性。