A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning

We introduce a family of synthetic languages with hierarchical structure -- generated by a broadcast process on trees -- for which the role of context length and reasoning in autoregressive generation can be analyzed precisely. At the heart of our analytic approach is an \emph{exact $k$-gram ansatz} in place of transformers with context length $k$, a substitution we then validate empirically. Using this ansatz we derive explicit asymptotic predictions for distributional statistics of the sequences produced by a trained model, instantiated in two settings. For the \emph{Ising broadcast process} (a soft-constrained language), we prove that the variance of the generated sum scales log-linearly in the context depth and its kurtosis converges to that of a Gaussian -- both deviating from the true language for any sublinear context. For the \emph{coloring broadcast process} (a hard-constrained language) in the freezing regime, bounded-context autoregression produces sequences that, with high probability, are inconsistent with \emph{any} valid coloring of the underlying tree. Together these results imply an $Ω(n)$ lower bound on the context length required to faithfully sample length-$n$ sequences. In contrast, we prove that an autoregressive \emph{reasoning} model with only $Θ(\log n)$ working memory can sample exactly from the true language -- an exponential improvement. We confirm both the lower-bound predictions and the reasoning-based upper bound empirically with transformers trained on the synthetic language; the trained models track our asymptotic predictions quantitatively across a wide range of context sizes.

翻译：摘要：我们引入了一类具有层次结构的合成语言——由树上的广播过程生成——从而能够精确分析自回归生成中上下文长度与推理的作用。我们分析的核心在于用精确的$k$-gram假设替代上下文长度为$k$的Transformer模型，并通过实验验证了这一替代的合理性。基于该假设，我们推导了训练模型生成序列的分布统计量的显式渐近预测，并在两种场景中实例化。对于伊辛广播过程（一种软约束语言），我们证明了生成序列和的方差随上下文深度呈对数线性缩放，其峰度收敛于高斯分布——这两者对于任何次线性上下文均偏离真实语言。对于冻结状态下的着色广播过程（一种硬约束语言），有界上下文的自回归生成产生的序列高概率与底层树的任何有效着色不一致。这些结果共同表明，如需忠实采样长度为$n$的序列，上下文长度需满足$Ω(n)$下界。相比之下，我们证明仅需$Θ(\log n)$工作记忆的自回归推理模型即可精确采样真实语言——实现了指数级提升。我们通过在合成语言上训练的Transformer模型，分别验证了下界预测和基于推理的上界；实验表明，训练模型在广泛的上下文尺寸范围内定量遵循我们的渐近预测。