AP-BMM: Approximating Capability-Cost Pareto Sets of LLMs via Asynchronous Prior-Guided Bayesian Model Merging

Serving Large Language Models (LLMs) often requires choosing between stronger reasoning and lower inference cost. Model merging offers a practical way to build several models between a reasoning-oriented model and a cheaper base model, but common model-level merging methods usually control this trade-off with only one or two global knobs. We study this setting as a multi-objective optimization problem: instead of producing one merged model, the goal is to find a set of merged models that cover different accuracy--token-cost preferences. Layer-wise merging is more flexible because it can assign different merge weights to different Transformer layers. However, it introduces two practical challenges. First, the layer-wise search space is large, and existing methods often search it without using helpful signals from the source models. Second, LLM evaluations can take very different amounts of time, so synchronous batch optimization wastes GPU time while waiting for slow evaluations. We propose Asynchronous Prior-Guided Bayesian Model Merging (AP-BMM). AP-BMM uses parameter and reasoning-activation differences between the source models to suggest which layers should matter early in the search. It also uses an asynchronous Bayesian optimization loop that accounts for candidate models already being evaluated. A lightweight reranking step further spreads candidates across the accuracy--cost trade-off. Under fixed evaluation budgets, AP-BMM achieves stronger Pareto-set quality and broader trade-off coverage than synchronous layer-wise baselines and representative model-level merging baselines. Compared with the synchronous Bayesian baseline, it also reduces wall-clock time by improving GPU utilization. Code: https://github.com/MiLab-HITSZ/AP-BMM.

翻译：在服务大语言模型（LLMs）时，通常需要在更强的推理能力和更低的推理成本之间做出选择。模型合并提供了一种实用方法，可在推理导向模型与更经济的基座模型之间构建多个模型，但常见的模型级合并方法通常仅通过一两个全局控制旋钮来调节这一权衡。我们将此设定视为多目标优化问题：目标并非生成一个合并模型，而是找到一组覆盖不同准确率-令牌成本偏好的合并模型。逐层合并因其能为不同Transformer层分配不同合并权重而更具灵活性，但这带来了两个实际挑战：首先，逐层搜索空间庞大，现有方法往往缺乏来自源模型的有用信号引导；其次，大语言模型评估所需时间差异极大，同步批量优化会在等待慢速评估时浪费GPU时间。我们提出异步先验引导的贝叶斯模型合并（AP-BMM）。AP-BMM利用源模型之间的参数和推理激活差异，在搜索早期提示哪些层应优先关注，并采用异步贝叶斯优化循环，将已在评估中的候选模型纳入考量。轻量级重排序步骤进一步沿准确率-成本权衡边界分散候选模型。在固定评估预算下，AP-BMM相比同步逐层基线及代表性模型级合并基线，获得了更强的帕累托集质量和更广的权衡覆盖范围。与同步贝叶斯基线相比，它还通过提升GPU利用率缩短了挂壁时间。代码：https://github.com/MiLab-HITSZ/AP-BMM。