We analyze neural scaling laws in a solvable model of last-layer fine-tuning where targets have intrinsic, instance-heterogeneous difficulty. In our Latent Instance Difficulty (LID) model, each input's target variance is governed by a latent ``precision'' drawn from a heavy-tailed distribution. While generalization loss recovers standard scaling laws, our main contribution connects this to inference. The pass@$k$ failure rate exhibits a power-law decay, $k^{-β_\text{eff}}$, but the observed exponent $β_\text{eff}$ is training-dependent. It grows with sample size $N$ before saturating at an intrinsic limit $β$ set by the difficulty distribution's tail. This coupling reveals that learning shrinks the ``hard tail'' of the error distribution: improvements in the model's generalization error steepen the pass@$k$ curve until irreducible target variance dominates. The LID model yields testable, closed-form predictions for this behavior, including a compute-allocation rule that favors training before saturation and inference attempts after. We validate these predictions in simulations and in two real-data proxies: CIFAR-10H (human-label variance) and a maths teacher-student distillation task.
翻译:我们分析了在目标具有内在、实例异质难度的可解末层微调模型中的神经缩放定律。在我们的潜在实例难度(LID)模型中,每个输入的目标方差由从重尾分布中抽取的潜在“精度”控制。虽然泛化损失恢复了标准的缩放定律,但我们的主要贡献在于将其与推理联系起来。pass@$k$ 失败率呈现幂律衰减 $k^{-β_\text{eff}}$,但观察到的指数 $β_\text{eff}$ 是训练依赖的。它随样本量 $N$ 增长,在达到由难度分布尾部设定的内在极限 $β$ 后饱和。这种耦合揭示了学习会收缩误差分布的“困难尾部”:模型泛化误差的改善使 pass@$k$ 曲线变陡,直到不可约的目标方差占主导地位。LID 模型为这种行为提供了可检验的闭式预测,包括一个计算分配规则:在饱和前优先训练,在饱和后增加推理尝试。我们在模拟以及两个真实数据代理(CIFAR-10H 的人类标注方差和数学师生蒸馏任务)中验证了这些预测。