While fusing heterogeneous open-source LLMs with varying architectures and sizes can potentially integrate the strengths of different models, existing fusion methods face significant challenges, such as vocabulary alignment and merging distribution matrices. These procedures are not only complex but also prone to introducing noise and errors. In this paper, we propose an implicit fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages preference optimization between the source LLMs and the target LLM to transfer their capabilities effectively. WRPO eliminates the need for vocabulary alignment and matrix fusion and can be efficiently scaled to accommodate various LLMs. To address distributional deviations between the source and target LLMs, WRPO introduces a progressive adaptation strategy that gradually shifts reliance on preferred examples from the target LLM to the source LLMs. Extensive experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks demonstrate that WRPO consistently outperforms existing knowledge fusion methods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct as the target model, WRPO achieves a length-controlled win rate of 55.9% against GPT-4-Preview-1106 on AlpacaEval-2 and a win rate of 46.2% against GPT-4-0314 on Arena-Hard. Our code is available at \url{https://github.com/SLIT-AI/WRPO}.
翻译:尽管融合具有不同架构和规模的异构开源大语言模型(LLM)能够整合不同模型的优势,但现有融合方法面临词汇表对齐和分布矩阵合并等重大挑战。这些过程不仅复杂,而且容易引入噪声和误差。本文提出一种隐式融合方法——加权奖励偏好优化(WRPO),该方法利用源LLM与目标LLM之间的偏好优化来有效迁移其能力。WRPO无需进行词汇表对齐和矩阵融合,并能高效扩展以适应各种LLM。针对源模型与目标模型之间的分布偏差,WRPO引入渐进式适应策略,逐步将依赖从目标LLM的偏好示例转移至源LLM。在MT-Bench、AlpacaEval-2和Arena-Hard基准上的大量实验表明,WRPO始终优于现有知识融合方法及多种微调基线。当以LLaMA3-8B-Instruct作为目标模型时,WRPO在AlpacaEval-2上针对GPT-4-Preview-1106实现了55.9%的长度控制胜率,在Arena-Hard上针对GPT-4-0314实现了46.2%的胜率。代码已发布于 \url{https://github.com/SLIT-AI/WRPO}。