Optimizing data mixtures is essential for unlocking the full potential of large language models (LLMs), yet identifying the optimal composition remains computationally prohibitive due to reliance on heuristic trials or expensive proxy training. To address this, we introduce \textbf{MergeMix}, a novel approach that efficiently determines optimal data mixing ratios by repurposing model merging weights as a high-fidelity, low-cost performance proxy. By training domain-specific experts on minimal tokens and optimizing their merging weights against downstream benchmarks, MergeMix effectively optimizes the performance of data mixtures without incurring the cost of full-scale training. Extensive experiments on models with 8B and 16B parameters validate that MergeMix achieves performance comparable to or surpassing exhaustive manual tuning while drastically reducing search costs. Furthermore, MergeMix exhibits high rank consistency (Spearman $ρ> 0.9$) and strong cross-scale transferability, offering a scalable, automated solution for data mixture optimization.
翻译:优化数据混合对于释放大型语言模型(LLM)的全部潜力至关重要,然而,由于依赖启发式试验或昂贵的代理训练,确定最优组成在计算上仍然代价高昂。为解决这一问题,我们提出了 \textbf{MergeMix},一种新颖的方法,通过将模型融合权重重新用作高保真、低成本的性能代理,来高效确定最优数据混合比例。该方法通过在少量令牌上训练特定领域的专家模型,并针对下游基准优化其融合权重,从而有效优化数据混合的性能,而无需承担全规模训练的成本。在具有 8B 和 16B 参数的模型上进行的大量实验验证了 MergeMix 能够达到与穷举式手动调优相当甚至更优的性能,同时大幅降低了搜索成本。此外,MergeMix 表现出较高的排序一致性(Spearman $ρ> 0.9$)和强大的跨规模可迁移性,为数据混合优化提供了一个可扩展的自动化解决方案。