Neural scaling laws provide a predictable recipe for AI advancement: reducing numerical precision should linearly improve computational efficiency and energy profile ($E \propto \mathrm{bits}$). In this paper, we demonstrate that this scaling law breaks in the context of multi-hop reasoning. We reveal a 'quantization trap' where reducing precision from 16-bit to 8/4-bit paradoxically increases net energy consumption while degrading reasoning accuracy. We provide a rigorous theoretical decomposition that attributes this failure to hardware casting overhead, the hidden latency cost of dequantization kernels, which becomes a dominant bottleneck in sequential reasoning chains, as well as to a sequential energy amortization failure. As a result, scaling law breaking is unavoidable in practice. We formalize a Critical Model Scale $N^*$ that predicts when the trap dissolves or deepens as a function of model size, batch size, and hardware configuration, validated across a 120$\times$ range (0.6B--72B) on six GPU architectures. Our findings suggest that the industry's "smaller-is-better" heuristic is mathematically counterproductive for complex reasoning tasks.
翻译:神经规模定律为人工智能发展提供了可预测的路径:降低数值精度应线性提升计算效率与能耗表现($E \propto \mathrm{bits}$)。本文证明,该规模定律在多跳推理场景中失效。我们揭示了"量化陷阱"现象——将精度从16比特降至8/4比特后,不仅推理精度下降,净能耗反而反常增加。我们提出严格的理论分解,将该失败归因于硬件类型转换开销、反量化核函数的隐性延迟成本(在序列推理链中成为主要瓶颈),以及序列能量摊销的失效。由此,规模定律的破坏在实践中不可避免。我们形式化定义了一个临界模型规模$N^*$,该指标可预测陷阱随模型参数量、批处理规模及硬件配置的消解或加深程度,并在六个GPU架构上经过120倍跨度(0.6B–72B)的验证。研究结果表明,业界"越小越好"的启发式策略在数学层面对复杂推理任务适得其反。