We investigate large language model performance across five orders of magnitude of compute scaling in eleven recent model architectures. We show that average benchmark performance, aggregating over many individual tasks and evaluations as in the commonly-used BIG-Bench dataset, is decently predictable as a function of training compute scale. Specifically, when extrapolating BIG-Bench Hard performance across one order of magnitude in compute, we observe average absolute errors of 6 percentage points (pp). By contrast, extrapolation for individual BIG-Bench tasks across an order of magnitude in compute yields higher average errors of 18pp. Nonetheless, individual task performance remains significantly more predictable than chance. Overall, our work suggests compute scaling provides a promising basis to forecast AI capabilities in diverse benchmarks, though predicting performance in specific tasks poses challenges.
翻译:我们研究了十一种近期模型架构中,语言模型性能在跨越五个数量级的计算扩展下的表现。研究表明,在常用的大规模基准数据集(如BIG-Bench)中,对多个独立任务与评估进行聚合得到的平均基准性能,可作为训练计算规模的函数获得良好预测。具体而言,当对BIG-Bench Hard性能进行跨越一个数量级的计算规模外推时,我们观察到平均绝对误差为6个百分点。相比之下,对单个BIG-Bench任务进行同等数量级的外推,其平均误差高达18个百分点。尽管如此,单项任务性能的可预测性仍显著优于随机水平。总体而言,我们的研究表明,计算扩展为预测人工智能在多样化基准中的能力提供了有前景的基础,但特定任务的性能预测仍面临挑战。