From Talking Words to Sharing Thoughts: Scalable Multi-LLM Aggregation via Structured Message Passing

The emergence of specialized, domain-tuned Large Language Models (LLMs) has demonstrated that smaller models can achieve expert-level performance in specific tasks, while struggling in out-of-domain settings. Current ensemble methods to combine their complementary expertise primarily rely on iterative re-prompting or cross-model refinement. These approaches suffer from high computational costs and latency because they require repeated LLM inference calls. Furthermore, naive aggregation often leads to anchor corruption, in which noise propagated from weaker models degrades the performance of the most accurate expert. To address these challenges, we propose a framework that integrates model predictions at the semantic layer using a bipartite factor graph. In this architecture, individual LLMs are represented as variable nodes, while a set of check nodes assess their consistency based on diverse epistemic criteria. We develop a message-passing protocol inspired by error-recovery systems to resolve disagreements iteratively. Furthermore, we introduce an asymmetric damping mechanism that protects high-reliability anchor nodes from being overridden by the ensemble majority. Unlike existing methods, our approach operates on output distributions and requires no additional LLM calls during the refinement phase. Evaluating on four benchmarks, including MMLU, MMLU-Pro, GPQA, and MedMCQA, our method demonstrates a 97% reduction in token usage and up to a 6X decrease in API calls, reducing inference time from several minutes to mere milliseconds while consistently outperforming leading multi-agent baselines. These results suggest that graph-based belief propagation is a robust, high-speed, and scalable alternative to the current multi-agent LLM systems. The full pipeline and code will be made public.

翻译：针对特定领域调优的大型语言模型（LLMs）的出现表明，虽然较小的模型在具体任务上能达到专家级表现，但在跨领域场景中仍显不足。现有融合其互补专长的集成方法主要依赖迭代式重新提示或跨模型精炼，此类方法因需反复调用LLM推理而面临高昂计算成本与延迟问题。更关键的是，简单聚合往往导致锚点污染——弱模型传播的噪声会降低最优专家的性能。为解决这些挑战，我们提出基于二分因子图在语义层集成模型预测的框架。在该架构中，各LLM被建模为变量节点，一组校验节点则依据不同认知标准评估其一致性。我们借鉴错误恢复系统设计迭代式消息传递协议以解决分歧，并引入非对称阻尼机制保护高可靠性锚点节点不被集成多数所覆盖。与现有方法不同，本方案基于输出分布运算，在精炼阶段无需额外LLM调用。在MMLU、MMLU-Pro、GPQA及MedMCQA四个基准测试中，本方法实现97%的标记用量缩减与最高6倍的API调用次数降低，将推理时间从数分钟压缩至毫秒级，同时持续优于主流多智能体基线方案。研究结果表明，基于图的置信传播机制是可替代当前多智能体LLM系统的高效、高速且可扩展方案。完整流程与代码将开源发布。