With the rapid progress of multi-agent large language model (LLM) reasoning, how to effectively aggregate answers from multiple LLMs has emerged as a fundamental challenge. Standard majority voting treats all answers equally, failing to consider latent heterogeneity and correlation across models. In this work, we design two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP), leveraging both first-order and second-order information. Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions, leading to more reliable collective decisions. We empirically validate our algorithms on synthetic datasets, popular LLM fine-tuning benchmarks such as UltraFeedback and MMLU, and a real-world healthcare setting ARMMAN. Across all cases, our methods consistently outperform majority voting, offering both practical performance gains and conceptual insights for the design of robust multi-agent LLM pipelines.
翻译:随着多智能体大语言模型(LLM)推理的快速发展,如何有效聚合多个LLM的答案已成为一个基础性挑战。标准的多数投票方法对所有答案一视同仁,未能考虑模型间潜在的异质性与相关性。本研究设计了两种新的聚合算法——最优权重(Optimal Weight, OW)与逆意外流行度(Inverse Surprising Popularity, ISP),它们同时利用了一阶与二阶信息。理论分析表明,在温和假设下,这些方法可证明地缓解多数投票的固有局限,从而得到更可靠的集体决策。我们在合成数据集、主流LLM微调基准(如UltraFeedback和MMLU)以及真实世界医疗场景ARMMAN上对算法进行了实证验证。在所有案例中,我们的方法均持续优于多数投票,不仅提供了实际性能提升,也为设计鲁棒的多智能体LLM流程带来了概念性启示。