Hit the Sweet Spot! Span-Level Ensemble for Large Language Models

Ensembling various LLMs to unlock their complementary potential and leverage their individual strengths is highly valuable. Previous studies typically focus on two main paradigms: sample-level and token-level ensembles. Sample-level ensemble methods either select or blend fully generated outputs, which hinders dynamic correction and enhancement of outputs during the generation process. On the other hand, token-level ensemble methods enable real-time correction through fine-grained ensemble at each generation step. However, the information carried by an individual token is quite limited, leading to suboptimal decisions at each step. To address these issues, we propose SweetSpan, a span-level ensemble method that effectively balances the need for real-time adjustments and the information required for accurate ensemble decisions. Our approach involves two key steps: First, we have each candidate model independently generate candidate spans based on the shared prefix. Second, we calculate perplexity scores to facilitate mutual evaluation among the candidate models and achieve robust span selection by filtering out unfaithful scores. To comprehensively evaluate ensemble methods, we propose a new challenging setting (ensemble models with significant performance gaps) in addition to the standard setting (ensemble the best-performing models) to assess the performance of model ensembles in more realistic scenarios. Experimental results in both standard and challenging settings across various language generation tasks demonstrate the effectiveness, robustness, and versatility of our approach compared with previous ensemble methods.

翻译：集成多种大语言模型以释放其互补潜力并利用各自优势具有重要价值。现有研究主要聚焦两种范式：样本级集成与词元级集成。样本级集成方法通常对完整生成输出进行选择或融合，这限制了生成过程中对输出的动态修正与增强。另一方面，词元级集成方法通过在每一步生成过程进行细粒度集成，实现了实时修正能力。然而单个词元所承载的信息量极为有限，易导致各步骤决策次优。为解决这些问题，我们提出SweetSpan——一种跨度级集成方法，能有效平衡实时调整需求与集成决策所需信息量。该方法包含两个关键步骤：首先，各候选模型基于共享前缀独立生成候选跨度；其次，通过计算困惑度分数促进候选模型间的相互评估，并通过过滤不可靠分数实现鲁棒的跨度选择。为全面评估集成方法，我们在标准设定（集成性能最优模型）基础上提出新的挑战性设定（集成性能差异显著的模型），以评估模型集成在更现实场景中的表现。在多种语言生成任务的标准与挑战性设定下的实验结果表明，相较于现有集成方法，我们的方法具有显著的有效性、鲁棒性与普适性。