Neural scaling laws have become foundational for optimizing large language model (LLM) training, yet they typically assume a single dense model output. This limitation effectively overlooks "Familial models, a transformative paradigm essential for realizing ubiquitous intelligence across heterogeneous device-edge-cloud hierarchies. Transcending static architectures, familial models integrate early exits with relay-style inference to spawn G deployable sub-models from a single shared backbone. In this work, we theoretically and empirically extend the scaling law to capture this "one-run, many-models" paradigm by introducing Granularity (G) as a fundamental scaling variable alongside model size (N) and training tokens (D). To rigorously quantify this relationship, we propose a unified functional form L(N, D, G) and parameterize it using large-scale empirical runs. Specifically, we employ a rigorous IsoFLOP experimental design to strictly isolate architectural impact from computational scale. Across fixed budgets, we systematically sweep model sizes (N) and granularities (G) while dynamically adjusting tokens (D). This approach effectively decouples the marginal cost of granularity from the benefits of scale, ensuring high-fidelity parameterization of our unified scaling law. Our results reveal that the granularity penalty follows a multiplicative power law with an extremely small exponent. Theoretically, this bridges fixed-compute training with dynamic architectures. Practically, it validates the "train once, deploy many" paradigm, demonstrating that deployment flexibility is achievable without compromising the compute-optimality of dense baselines.
翻译:神经缩放定律已成为优化大型语言模型训练的基础,但通常假设单一稠密模型输出。这一局限实质上忽视了"家族模型"——一种对实现异构设备-边缘-云层级泛在智能至关重要的变革性范式。超越静态架构,家族模型通过集成早退机制与中继式推理,从单一共享主干网络衍生出G个可部署子模型。本研究通过引入粒度作为模型规模和训练词元之外的基本缩放变量,从理论与实证层面将缩放定律扩展至这种"一次训练,多模型部署"范式。为严格量化该关系,我们提出统一函数形式L(N, D, G)并通过大规模实验运行进行参数化。具体而言,我们采用严谨的等计算量实验设计,将架构影响与计算规模严格隔离。在固定计算预算下,系统扫描模型规模与粒度,同时动态调整训练词元。该方法有效解耦了粒度的边际成本与规模收益,确保统一缩放定律的高保真参数化。研究结果表明:粒度惩罚遵循指数极小的乘幂律。理论上,这桥接了固定计算训练与动态架构设计;实践上,验证了"一次训练,多处部署"范式,证明在保持稠密基线计算最优性的同时,部署灵活性是可实现的。