Ensembling multiple models has always been an effective approach to push the limits of existing performance and is widely used in classification tasks by simply averaging the classification probability vectors from multiple classifiers to achieve better accuracy. However, in the thriving open-source Large Language Model (LLM) community, ensembling methods are rare and typically limited to ensembling the full-text outputs of LLMs, such as selecting the best output using a ranker, which leads to underutilization of token-level probability information. In this paper, we treat the Generation of each token by LLMs as a Classification (GaC) for ensembling. This approach fully exploits the probability information at each generation step and better prevents LLMs from producing early incorrect tokens that lead to snowballing errors. In experiments, we ensemble state-of-the-art LLMs on several benchmarks, including exams, mathematics and reasoning, and observe that our method breaks the existing community performance ceiling. Furthermore, we observed that most of the tokens in the answer are simple and do not affect the correctness of the final answer. Therefore, we also experimented with ensembling only key tokens, and the results showed better performance with lower latency across benchmarks.
翻译:集成多个模型一直是突破现有性能极限的有效方法,在分类任务中被广泛使用,通常只需对多个分类器的分类概率向量进行平均即可获得更高的准确率。然而,在蓬勃发展的开源大语言模型(LLM)社区中,集成方法较为罕见,且通常局限于对LLM的全文输出进行集成,例如使用排序器选择最佳输出,这导致令牌级概率信息未能得到充分利用。本文提出将LLM的每个令牌生成视为分类任务(GaC)进行集成。该方法充分利用了每个生成步骤中的概率信息,能更有效地防止LLM产生早期错误令牌而导致误差累积。在实验中,我们在多项基准测试(包括考试、数学与推理任务)上对最先进的LLM进行集成,观察到该方法突破了现有社区性能上限。此外,我们发现答案中的大多数令牌较为简单,并不影响最终答案的正确性。因此,我们还尝试仅对关键令牌进行集成,结果表明该方法在多个基准测试中能以更低的延迟获得更优的性能。