Large language models (LLMs) have shown complementary strengths in various tasks and instances, motivating the research of ensembling LLMs to push the frontier leveraging the wisdom of the crowd. Existing work achieves this objective via training the extra reward model or fusion model to select or fuse all candidate answers. However, these methods pose a great challenge to the generalizability of the trained models. Besides, existing methods use the textual responses as communication media, ignoring the rich information in the inner representations of neural networks. Therefore, we propose a training-free ensemble framework DEEPEN, averaging the probability distributions outputted by different LLMs. A key challenge in this paradigm is the vocabulary discrepancy between heterogeneous LLMs, which hinders the operation of probability distribution averaging. To address this challenge, DEEPEN maps the probability distribution of each model from the probability space to a universe relative space based on the relative representation theory, and performs aggregation. Then, the result of aggregation is mapped back to the probability space of one LLM via a search-based inverse transformation to determine the generated token. We conduct experiments on the ensemble of various LLMs of 6B to 70B. Experimental results show that DEEPEN achieves consistent improvements across six popular benchmarks involving subject examination, reasoning and knowledge-QA, proving the effectiveness of our approach.
翻译:摘要:大型语言模型(LLMs)在不同任务和实例中展现出互补优势,这促使研究者探索通过集成LLMs借助群体智慧推动技术前沿。现有工作通过训练额外的奖励模型或融合模型来筛选或融合候选答案实现该目标,但此类方法对已训练模型的泛化能力构成重大挑战。此外,现有方法以文本响应作为通信媒介,忽略了神经网络内部表征中蕴含的丰富信息。因此,我们提出一种无需训练的集成框架DEEPEN,通过对不同LLMs输出的概率分布进行平均化处理。该范式的核心挑战在于异构LLMs之间的词汇表差异阻碍了概率分布平均操作的实现。为应对此挑战,DEEPEN基于相对表征理论,将各模型输出的概率分布从概率空间映射到统一相对空间进行聚合,再通过基于搜索的逆变换将聚合结果映射回某个LLM的概率空间,从而确定生成词元。我们在参数规模从6B到70B的多种LLMs上开展集成实验,结果表明DEEPEN在涉及学科测验、推理和知识问答的六项主流基准测试中均取得一致性提升,验证了该方法的有效性。