The Bayesian Geometry of Transformer Attention

Transformers often appear to perform Bayesian reasoning in context, but verifying this rigorously has been impossible: natural data lack analytic posteriors, and large models conflate reasoning with memorization. We address this by constructing \emph{Bayesian wind tunnels} -- controlled environments where the true posterior is known in closed form and memorization is provably impossible. In these settings, small transformers reproduce Bayesian posteriors with $10^{-3}$-$10^{-4}$ bit accuracy, while capacity-matched MLPs fail by orders of magnitude, establishing a clear architectural separation. Across two tasks -- bijection elimination and Hidden Markov Model (HMM) state tracking -- we find that transformers implement Bayesian inference through a consistent geometric mechanism: residual streams serve as the belief substrate, feed-forward networks perform the posterior update, and attention provides content-addressable routing. Geometric diagnostics reveal orthogonal key bases, progressive query-key alignment, and a low-dimensional value manifold parameterized by posterior entropy. During training this manifold unfurls while attention patterns remain stable, a \emph{frame-precision dissociation} predicted by recent gradient analyses. Taken together, these results demonstrate that hierarchical attention realizes Bayesian inference by geometric design, explaining both the necessity of attention and the failure of flat architectures. Bayesian wind tunnels provide a foundation for mechanistically connecting small, verifiable systems to reasoning phenomena observed in large language models.

翻译：Transformer模型在上下文处理中常表现出贝叶斯推理的特征，但严格验证这一直存在困难：自然数据缺乏解析后验分布，且大型模型将推理与记忆混为一谈。我们通过构建"贝叶斯风洞"——即真实后验分布具有闭式解且记忆被严格证明不可能发生的受控环境——来解决这一问题。在此类设定中，小型transformer模型能以$10^{-3}$-$10^{-4}$比特精度复现贝叶斯后验分布，而容量匹配的多层感知机则存在数量级误差，这确立了明确的架构分离效应。通过双射消除和隐马尔可夫模型状态追踪两项任务，我们发现transformer通过一致的几何机制实现贝叶斯推理：残差流作为信念载体，前馈网络执行后验更新，注意力机制提供内容寻址路由。几何诊断揭示了正交键基、渐进式查询-键对齐，以及由后验熵参数化的低维值流形。训练过程中该流形逐步展开而注意力模式保持稳定，这种"框架-精度解耦"现象与近期梯度分析预测相符。综合而言，这些结果表明层级注意力机制通过几何设计实现贝叶斯推理，既解释了注意力机制的必要性，也阐明了扁平架构失效的原因。贝叶斯风洞为建立小型可验证系统与大型语言模型推理现象之间的机制联系奠定了理论基础。