Neural scaling laws describe how model performance improves as a power law with size, but existing work focuses on models above 100M parameters. The sub-20M regime -- where TinyML and edge AI operate -- remains unexamined. We train 90 models (22K--19.8M parameters) across two architectures (plain ConvNet, MobileNetV2) on CIFAR-100, varying width while holding depth and training fixed. Both follow approximate power laws in error rate: $α= 0.156 \pm 0.002$ (ScaleCNN) and $α= 0.106 \pm 0.001$ (MobileNetV2) across five seeds. Since prior work fit cross-entropy loss rather than error rate, direct exponent comparison is approximate; with that caveat, these are 1.4--2x steeper than $α\approx 0.076$ for large language models. The power law does not hold uniformly: local exponents decay with scale, and MobileNetV2 saturates at 19.8M parameters ($α_{\mathrm{local}} = 0.006$). Error structure also changes. Jaccard overlap between error sets of the smallest and largest ScaleCNN is only 0.35 (25 seed pairs, $\pm 0.004$) -- compression changes which inputs are misclassified, not merely how many. Small models concentrate capacity on easy classes (Gini: 0.26 at 22K vs. 0.09 at 4.7M) while abandoning the hardest (bottom-5 accuracy: 10% vs. 53%). Counter to expectation, the smallest models are best calibrated (ECE = 0.013 vs. peak 0.110 at mid-size). Aggregate accuracy is therefore misleading for edge deployment; validation must happen at the target model size.
翻译:神经尺度律描述了模型性能如何随规模呈幂律提升,但现有研究主要关注参数量超过1亿的模型。参数量低于2000万的领域——即TinyML与边缘AI的运行区间——尚未得到充分研究。我们在CIFAR-100数据集上训练了90个模型(参数量22K-19.8M),涵盖两种架构(普通ConvNet、MobileNetV2),在固定深度和训练设置下调整宽度。两类模型在错误率上均遵循近似幂律关系:$α= 0.156 \pm 0.002$(ScaleCNN)与$α= 0.106 \pm 0.001$(MobileNetV2)(基于五次随机种子)。由于先前研究拟合的是交叉熵损失而非错误率,直接进行指数比较仅为近似估计;在此前提下,这些指数比大型语言模型的$α\approx 0.076$陡峭1.4-2倍。幂律并非全局成立:局部指数随规模衰减,且MobileNetV2在19.8M参数量处趋于饱和($α_{\mathrm{local}} = 0.006$)。错误结构亦发生变化:最小与最大ScaleCNN模型错误集合间的杰卡德重叠系数仅为0.35(25组种子对,$\pm 0.004$)——模型压缩改变了被误分类的样本构成,而不仅是错误数量。小模型将容量集中于简单类别(基尼系数:22K时为0.26 vs. 4.7M时为0.09),同时放弃了最困难类别(末5位准确率:10% vs. 53%)。与预期相反,最小模型的校准度最佳(ECE = 0.013 vs. 中等规模峰值0.110)。因此,聚合准确率对边缘部署具有误导性;验证必须在目标模型规模下进行。