Theoretical Foundations of Scaling Law in Familial Models

Neural scaling laws have become foundational for optimizing large language model (LLM) training, yet they typically assume a single dense model output. This limitation effectively overlooks "Familial models, a transformative paradigm essential for realizing ubiquitous intelligence across heterogeneous device-edge-cloud hierarchies. Transcending static architectures, familial models integrate early exits with relay-style inference to spawn G deployable sub-models from a single shared backbone. In this work, we theoretically and empirically extend the scaling law to capture this "one-run, many-models" paradigm by introducing Granularity (G) as a fundamental scaling variable alongside model size (N) and training tokens (D). To rigorously quantify this relationship, we propose a unified functional form L(N, D, G) and parameterize it using large-scale empirical runs. Specifically, we employ a rigorous IsoFLOP experimental design to strictly isolate architectural impact from computational scale. Across fixed budgets, we systematically sweep model sizes (N) and granularities (G) while dynamically adjusting tokens (D). This approach effectively decouples the marginal cost of granularity from the benefits of scale, ensuring high-fidelity parameterization of our unified scaling law. Our results reveal that the granularity penalty follows a multiplicative power law with an extremely small exponent. Theoretically, this bridges fixed-compute training with dynamic architectures. Practically, it validates the "train once, deploy many" paradigm, demonstrating that deployment flexibility is achievable without compromising the compute-optimality of dense baselines.

翻译：神经缩放定律已成为优化大型语言模型训练的基础性工具，但通常假设单一稠密模型的输出。这一局限实质上忽视了"家族模型"——一种对于实现跨异构设备-边缘-云层级泛在智能至关重要的变革性范式。超越静态架构的家族模型，通过集成早退出口与中继式推理，能够从单一共享主干网络衍生出G个可部署子模型。本研究从理论与实证层面扩展了缩放定律，以捕捉这种"一次训练，多模型部署"范式，引入粒度作为除模型规模与训练令牌数之外的基本缩放变量。为严格量化该关系，我们提出统一函数形式，并通过大规模实证实验进行参数化。具体而言，我们采用严谨的等计算量实验设计，将架构影响与计算规模严格分离。在固定计算预算下，我们系统性地遍历模型规模与粒度参数，同时动态调整训练令牌数。该方法有效解耦了粒度边际成本与规模收益，确保统一缩放定律的高保真参数化。研究结果表明：粒度惩罚遵循指数极小的乘幂定律。理论上，这架起了固定计算训练与动态架构之间的桥梁；实践上，验证了"一次训练，多处部署"范式的可行性，证明在保持稠密基线计算最优性的同时，完全能够实现部署灵活性。