Large language models (LLMs) are often ensembled together to improve overall reliability and robustness, but in practice models are strongly correlated. This raises a fundamental question: which models should be selected when forming an LLM ensemble? We formulate budgeted ensemble selection as maximizing the mutual information between the true label and predictions of the selected models. Furthermore, to explain why performance can saturate even with many models, we model the correlated errors of the models using Gaussian-copula and show an information-theoretic error floor for the performance of the ensemble. Motivated by these, we propose a simple greedy mutual-information selection algorithm that estimates the required information terms directly from data and iteratively builds an ensemble under a query budget. We test our approach in two question answering datasets and one binary sentiment classification dataset: MEDMCQA, MMLU, and IMDB movie reviews. Across all datasets, we observe that our method consistently outperforms strong baselines under the same query budget.
翻译:大型语言模型(LLM)常被集成在一起以提高整体可靠性和鲁棒性,但在实践中模型之间存在强相关性。这引发了一个根本性问题:在构建LLM集成时应如何选择模型?我们将预算约束下的集成选择问题形式化为最大化真实标签与所选模型预测之间的互信息。此外,为解释为何即使使用多个模型性能仍可能饱和,我们采用高斯-联结函数对模型的相关性误差进行建模,并证明了集成性能存在信息论意义上的误差下界。受此启发,我们提出一种简单的贪心互信息选择算法,该算法直接从数据中估计所需的信息项,并在查询预算约束下迭代构建集成。我们在两个问答数据集(MEDMCQA、MMLU)和一个二元情感分类数据集(IMDB电影评论)上测试了我们的方法。在所有数据集中,我们观察到在相同查询预算下,该方法始终优于强基线模型。