With the rapid progress of multi-agent large language model (LLM) reasoning, how to effectively aggregate answers from multiple LLMs has emerged as a fundamental challenge. Standard majority voting treats all answers equally, failing to consider latent heterogeneity and correlation across models. In this work, we design two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP), leveraging both first-order and second-order information. Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions, leading to more reliable collective decisions. We empirically validate our algorithms on synthetic datasets, popular LLM fine-tuning benchmarks such as UltraFeedback and MMLU, and a real-world healthcare setting ARMMAN. Our algorithms consistently outperform standard baselines, establishing a robust, training-free framework for effective multi-agent LLM aggregation.
翻译:随着多智能体大语言模型推理的快速发展,如何有效聚合多个LLM的答案已成为一个基础性挑战。标准多数投票平等对待所有答案,未能考虑模型间潜在的异质性和相关性。在本工作中,我们设计了两种新的聚合算法,称为最优加权(OW)和逆惊奇流行度(ISP),同时利用了第一阶和第二阶信息。我们的理论分析表明,在温和假设下,这些方法可证明地缓解了多数投票的内在局限,从而产生更可靠的集体决策。我们在合成数据集、流行的LLM微调基准(如UltraFeedback和MMLU)以及真实医疗场景ARMMAN上对算法进行了实证验证。我们的算法始终优于标准基线,为有效的多智能体LLM聚合建立了一个稳健、无需训练的框架。