The superficial alignment hypothesis (SAH) posits that large language models learn most of their knowledge during pre-training, and that post-training merely surfaces this knowledge. The SAH, however, lacks a precise definition, which has led to (i) different and seemingly orthogonal arguments supporting it, and (ii) important critiques to it. We propose a new metric called task complexity: the length of the shortest program that achieves a target performance on a task. In this framework, the SAH simply claims that pre-trained models drastically reduce the complexity of achieving high performance on many tasks. Our definition unifies prior arguments supporting the SAH, interpreting them as different strategies to find such short programs. Experimentally, we estimate the task complexity of mathematical reasoning, machine translation, and instruction following; we then show that these complexities can be remarkably low when conditioned on a pre-trained model. Further, we find that pre-training enables access to strong performances on our tasks, but it can require programs of gigabytes of length to access them. Post-training, on the other hand, collapses the complexity of reaching this same performance by several orders of magnitude. Overall, our results highlight that task adaptation often requires surprisingly little information -- often just a few kilobytes.
翻译:表面对齐假设(SAH)认为,大型语言模型的知识主要是在预训练阶段习得的,而后续训练仅是将这些知识显现出来。然而,SAH 缺乏精确的定义,这导致了(i)支持该假设的不同且看似正交的论点,以及(ii)对其的重要批评。我们提出了一种称为任务复杂性的新度量标准:即在特定任务上达到目标性能所需的最短程序长度。在此框架下,SAH 仅主张预训练模型能大幅降低在许多任务上实现高性能的复杂性。我们的定义统一了先前支持 SAH 的论点,将其解释为寻找此类短程序的不同策略。在实验中,我们估算了数学推理、机器翻译和指令跟随的任务复杂性;随后证明,当以预训练模型为条件时,这些复杂性可以显著降低。此外,我们发现预训练确实能够使模型在我们的任务上获得强劲性能,但可能需要长达千兆字节的程序才能实现。相比之下,后续训练将达到同等性能所需的复杂性降低了数个数量级。总体而言,我们的结果表明任务适应通常只需要极少的信息——往往仅需几千字节。