Large language models (LLMs) exhibit complementary strengths in various tasks, motivating the research of LLM ensembling. However, existing work focuses on training an extra reward model or fusion model to select or combine all candidate answers, posing a great challenge to the generalization on unseen data distributions. Besides, prior methods use textual responses as communication media, ignoring the valuable information in the internal representations. In this work, we propose a training-free ensemble framework DeePEn, fusing the informative probability distributions yielded by different LLMs at each decoding step. Unfortunately, the vocabulary discrepancy between heterogeneous LLMs directly makes averaging the distributions unfeasible due to the token misalignment. To address this challenge, DeePEn maps the probability distribution of each model from its own probability space to a universal relative space based on the relative representation theory, and performs aggregation. Next, we devise a search-based inverse transformation to transform the aggregated result back to the probability space of one of the ensembling LLMs (main model), in order to determine the next token. We conduct extensive experiments on ensembles of different number of LLMs, ensembles of LLMs with different architectures, and ensembles between the LLM and the specialist model. Experimental results show that (i) DeePEn achieves consistent improvements across six benchmarks covering subject examination, reasoning, and knowledge, (ii) a well-performing specialist model can benefit from a less effective LLM through distribution fusion, and (iii) DeePEn has complementary strengths with other ensemble methods such as voting.
翻译:大语言模型(LLM)在不同任务中展现出互补优势,这推动了LLM集成方法的研究。然而,现有工作主要集中于训练额外的奖励模型或融合模型来筛选或组合所有候选答案,这对未见数据分布的泛化能力提出了巨大挑战。此外,先前方法使用文本响应作为通信媒介,忽略了内部表征中的宝贵信息。本文提出一种无需训练的集成框架DeePEn,该框架在每个解码步骤融合不同LLM生成的信息量丰富的概率分布。然而,异构LLM之间的词汇表差异导致词元无法对齐,使得直接对概率分布进行平均操作不可行。为解决这一挑战,DeePEn基于相对表示理论将每个模型的概率分布从其自身的概率空间映射到统一的相对空间,并进行聚合。接着,我们设计了一种基于搜索的逆变换方法,将聚合结果转换回其中一个集成LLM(主模型)的概率空间,以确定下一个词元。我们进行了大量实验,包括不同数量LLM的集成、不同架构LLM的集成以及LLM与专业模型之间的集成。实验结果表明:(i)DeePEn在涵盖学科考试、推理和知识的六个基准测试中均取得持续改进;(ii)性能优异的专业模型可通过分布融合从效果较差的LLM中获益;(iii)DeePEn与投票等其他集成方法具有互补优势。