The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance for comprehensive understanding of scaling properties. This is challenged by: 1) the emergence phenomenon, where unpredictable capabilities appearing suddenly at critical model scales; and 2) uneven task difficulty and inconsistent performance scaling patterns, leading to high metric variability. Current prediction methods lack accuracy and reliability. We propose a Clustering-On-Difficulty (COD) framework for downstream performance prediction. The COD framework clusters tasks by their difficulty scaling features, thereby constructing a more stable and predictable task subset that exhibits well-behaved scaling characteristics with the increase of compute budget. We adopt a performance scaling law to predict cluster-wise performance with theoretical support. Predictable subset performance acts as an intermediate predictor for the full evaluation set. We further derive a mapping function to accurately extrapolate the performance of the subset to the full set. Applied to an LLM with 70B parameters, COD achieved a 1.55\% average prediction error across eight key LLM benchmarks, thus providing actionable insights for scaling properties and training monitoring during LLM pre-training.
翻译:大型语言模型(LLM)训练规模与成本的不断攀升,要求我们能够通过预训练阶段准确预测下游任务性能,从而全面理解其扩展特性。这一目标面临两大挑战:1)涌现现象,即不可预测的能力在关键模型规模处突然出现;2)任务难度不均与性能扩展模式不一致,导致评估指标存在高变异性。现有预测方法在准确性与可靠性方面均存在不足。为此,我们提出了一种基于任务难度聚类的下游性能预测框架。该框架依据任务难度扩展特征对任务进行聚类,从而构建一个更稳定、更可预测的任务子集,该子集随着计算预算的增加展现出良好的扩展特性。我们采用性能扩展定律,在理论支持下预测各聚类性能。可预测子集的性能可作为完整评估集的中间预测指标。我们进一步推导出一个映射函数,用于将子集性能准确外推至完整集。在一个拥有700亿参数的LLM上进行应用,COD在八个关键LLM基准测试中实现了平均1.55%的预测误差,从而为LLM预训练期间的扩展特性分析与训练监控提供了可操作的见解。