Scaling Laws in the Tiny Regime: How Small Models Change Their Mistakes

Neural scaling laws describe how model performance improves as a power law with size, but existing work focuses on models above 100M parameters. The sub-20M regime -- where TinyML and edge AI operate -- remains unexamined. We train 90 models (22K--19.8M parameters) across two architectures (plain ConvNet, MobileNetV2) on CIFAR-100, varying width while holding depth and training fixed. Both follow approximate power laws in error rate: $α= 0.156 \pm 0.002$ (ScaleCNN) and $α= 0.106 \pm 0.001$ (MobileNetV2) across five seeds. Since prior work fit cross-entropy loss rather than error rate, direct exponent comparison is approximate; with that caveat, these are 1.4--2x steeper than $α\approx 0.076$ for large language models. The power law does not hold uniformly: local exponents decay with scale, and MobileNetV2 saturates at 19.8M parameters ($α_{\mathrm{local}} = 0.006$). Error structure also changes. Jaccard overlap between error sets of the smallest and largest ScaleCNN is only 0.35 (25 seed pairs, $\pm 0.004$) -- compression changes which inputs are misclassified, not merely how many. Small models concentrate capacity on easy classes (Gini: 0.26 at 22K vs. 0.09 at 4.7M) while abandoning the hardest (bottom-5 accuracy: 10% vs. 53%). Counter to expectation, the smallest models are best calibrated (ECE = 0.013 vs. peak 0.110 at mid-size). Aggregate accuracy is therefore misleading for edge deployment; validation must happen at the target model size.

翻译：神经尺度律描述了模型性能如何随规模呈幂律提升，但现有研究主要关注参数量超过1亿的模型。参数量低于2000万的领域——即TinyML与边缘AI的运行区间——尚未得到充分研究。我们在CIFAR-100数据集上训练了90个模型（参数量22K-19.8M），涵盖两种架构（普通ConvNet、MobileNetV2），在固定深度和训练设置下调整宽度。两类模型在错误率上均遵循近似幂律关系：$α= 0.156 \pm 0.002$（ScaleCNN）与$α= 0.106 \pm 0.001$（MobileNetV2）（基于五次随机种子）。由于先前研究拟合的是交叉熵损失而非错误率，直接进行指数比较仅为近似估计；在此前提下，这些指数比大型语言模型的$α\approx 0.076$陡峭1.4-2倍。幂律并非全局成立：局部指数随规模衰减，且MobileNetV2在19.8M参数量处趋于饱和（$α_{\mathrm{local}} = 0.006$）。错误结构亦发生变化：最小与最大ScaleCNN模型错误集合间的杰卡德重叠系数仅为0.35（25组种子对，$\pm 0.004$）——模型压缩改变了被误分类的样本构成，而不仅是错误数量。小模型将容量集中于简单类别（基尼系数：22K时为0.26 vs. 4.7M时为0.09），同时放弃了最困难类别（末5位准确率：10% vs. 53%）。与预期相反，最小模型的校准度最佳（ECE = 0.013 vs. 中等规模峰值0.110）。因此，聚合准确率对边缘部署具有误导性；验证必须在目标模型规模下进行。