Transformer scaling law analyses typically treat parameters as interchangeable; an abstraction that accurately predicts loss-compute relationships. Yet, in sub-billion-parameter small language models (SLMs), embedding matrices dominate the parameter budget. This work argues that this allocation is as suboptimal as it is counterintuitive. Leviathan is an architecture with a continuous embedding generator to replace the discrete lookup tables of canonical models. Evaluating on the Pile dataset under isoparametric settings, Leviathan consistently outperforms a standard, LLaMA-style architecture. By means of an empirical power-law fit, Leviathan exhibits a markedly superior effective parameter capacity. Across the regime studied, Leviathan behaves as a dense model with $1.47$ to $2.11 \times$ more parameters.
翻译:Transformer 的缩放律分析通常将参数视为可互换的,这种抽象能准确预测损失-计算关系。然而,在参数规模低于十亿的小型语言模型(SLMs)中,嵌入矩阵占据了参数预算的主导地位。本文认为,这种分配既低效又反直觉。Leviathan 是一种采用连续嵌入生成器来替代经典模型中离散查找表的架构。在等参数设置下使用 Pile 数据集进行评估,Leviathan 始终优于标准的 LLaMA 风格架构。通过经验幂律拟合,Leviathan 展现出显著更优的有效参数容量。在所研究的范围内,Leviathan 的表现相当于一个参数多出 $1.47$ 至 $2.11$ 倍的稠密模型。