Existing approaches for analyzing neural network activations, such as PCA and sparse autoencoders, rely on strong structural assumptions. Generative models offer an alternative: they can uncover structure without such assumptions and act as priors that improve intervention fidelity. We explore this direction by training diffusion models on one billion residual stream activations, creating "meta-models" that learn the distribution of a network's internal states. We find that diffusion loss decreases smoothly with compute and reliably predicts downstream utility. In particular, applying the meta-model's learned prior to steering interventions improves fluency, with larger gains as loss decreases. Moreover, the meta-model's neurons increasingly isolate concepts into individual units, with sparse probing scores that scale as loss decreases. These results suggest generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions. Project page: https://generative-latent-prior.github.io.
翻译:现有分析神经网络激活的方法,如主成分分析和稀疏自编码器,依赖于较强的结构假设。生成模型提供了一种替代方案:它们能够在无需此类假设的情况下揭示结构,并作为先验提升干预保真度。我们通过在一亿个残差流激活上训练扩散模型来探索这一方向,创建了学习网络内部状态分布的“元模型”。我们发现扩散损失随计算量平滑下降,并能可靠预测下游效用。特别地,将元模型习得的先验应用于导向干预可提升流畅性,且损失越低增益越大。此外,元模型的神经元逐渐将概念分离至独立单元,其稀疏探测分数随损失下降而提升。这些结果表明生成式元模型为无需严格结构假设的可扩展可解释性提供了可行路径。项目页面:https://generative-latent-prior.github.io。