While empirical scaling laws for LLM reasoning are well-documented, the theoretical mechanisms governing out-of-distribution (OOD) generalization remain elusive. We formalize reasoning via optimal transport, projecting discrete trajectories into a continuous metric space to quantify domain shifts using the Wasserstein-1 distance. Invoking Kantorovich duality, we bound OOD generalization via architectural Lipschitz continuity and functional approximation limits. This exposes two primary constraints. First, position-dependent attention (e.g., Absolute Positional Encoding) fails to preserve shift invariance, yielding an $Ω(1)$ Lipschitz constant and expected risk, whereas shift-invariant mechanisms (e.g., Rotary Embeddings) preserve equivariance and bound the error. Second, by mapping sequential backtracking to a Dyck-$k$ language, we establish a strict circuit depth lower bound for $\text{TC}^0$ Transformers. Scaling physical layer depth is necessary to avert representation collapse -- a constraint that scaling representation width cannot bypass due to irreducible approximation bounds in Barron spaces. Evaluations across 54 Transformer configurations on combinatorial search corroborate these bounds, demonstrating that generalization risk degrades monotonically with the Wasserstein domain shift.
翻译:尽管LLM推理的经验性缩放定律已得到充分验证,但支配分布外泛化的理论机制仍不明确。我们通过最优传输理论形式化推理过程,将离散轨迹投影到连续度量空间中,利用Wasserstein-1距离量化领域偏移。通过引入Kantorovich对偶性,我们借助架构的Lipschitz连续性与函数逼近极限来界定分布外泛化边界,从而揭示两大约束条件。其一,位置依赖型注意力机制(如绝对位置编码)无法保持平移不变性,导致$Ω(1)$阶Lipschitz常数与期望风险值;而平移不变机制(如旋转位置嵌入)可保持等变性并约束误差。其二,通过将序列回溯映射至Dyck-$k$语言,我们证明$\text{TC}^0$类Transformer存在严格的电路深度下界。在Barron空间中,由不可约逼近界限导致的表示坍缩,必须通过扩展物理层深度来规避——这一约束无法通过增加表示宽度来绕过。基于54种Transformer配置在组合搜索任务上的评估验证了这些界限,证明泛化风险随Wasserstein领域偏移呈单调递增趋势。