Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance. We present our alternative explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with claimed emergent abilities; (2) make, test and confirm two predictions about metric choices in a meta-analysis of emergent abilities on BIG-Bench; and (3) show to choose metrics to produce never-before-seen seemingly emergent abilities in multiple vision tasks across diverse deep networks. Via all three analyses, we provide evidence that alleged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of scaling AI models.
翻译:近期研究声称,大型语言模型展现出涌现能力——即较小规模模型中不存在、而较大规模模型中呈现的能力。涌现能力之所以引人关注,源于两个特征:其尖锐性(似乎从不存在到存在的瞬间转变)与不可预测性(在看似无法预见的模型规模下突然出现)。在此,我们提出对涌现能力的替代性解释:针对特定任务与模型族,在分析固定模型输出时,涌现能力的出现源于研究者选择的评估指标,而非源于模型行为随规模发生的根本性变化。具体而言,非线性或不连续指标会产生表面上的涌现能力,而线性或连续指标则呈现模型性能平滑、连续且可预测的变化。我们通过一个简洁的数学模型阐释该替代性解释,并通过三种互补方式加以验证:(1)基于InstructGPT/GPT-3模型族,针对声称具有涌现能力的任务,提出、验证并确认关于指标选择的三个预测;(2)在BIG-Bench涌现能力的元分析中,提出、验证并确认关于指标选择的两个预测;(3)展示如何通过选择指标,在多种视觉任务与不同深度网络中制造前所未见、看似涌现的能力。通过这三项分析,我们提供证据表明:所谓的涌现能力在更换指标或优化统计方法后会消失,可能并非扩展AI模型的基本属性。