Weight tying is widely used in compact language models to reduce parameters by sharing the token table between the input embedding and the output projection. However, weight sharing does not guarantee a stable token interface: during training, the correspondence between encoding tokens into hidden states and decoding hidden states into logits can drift, worsening optimization sensitivity and making post-training interventions such as editing, patching, and lightweight adaptation less predictable. We propose Pseudo-Inverse Tying (PIT), which synchronizes embedding and unembedding as coupled projections of a shared latent token memory, guaranteeing a pseudo-inverse-consistent interface throughout training. PIT maintains an orthonormal shared memory, obtained by thin polar decomposition for teacher initialization or random orthonormal initialization from scratch, and introduces a fully learned symmetric positive definite hidden-space transform parameterized via a Cholesky factor. The output head applies this transform to hidden states before the vocabulary projection, while the embedding applies the inverse transform to token vectors using stable triangular solves, avoiding explicit pseudo-inverse recomputation and any vocabulary-sized auxiliary parameters. We evaluate PIT on on-device models spanning 256M-1.3B parameters across pretraining and adaptation, and consistently observe improved training stability, stronger layerwise semantic consistency, and substantially reduced side effects.
翻译:权重绑定在紧凑型语言模型中广泛应用,通过共享输入嵌入与输出投影间的词符表来减少参数量。然而,权重共享并不能保证稳定的词符接口:训练过程中,将词符编码为隐藏状态与将隐藏状态解码为逻辑值的对应关系可能发生漂移,从而恶化优化敏感性,并使得训练后干预(如编辑、修补和轻量级适配)的预测性降低。我们提出伪逆绑定(PIT),该方法将嵌入与解嵌入同步为共享潜在词符记忆的耦合投影,从而在整个训练过程中保证伪逆一致的接口。PIT 通过薄极分解进行教师初始化或从头进行随机正交初始化,来维持一个正交归一化的共享记忆,并引入一个通过 Cholesky 因子参数化的完全可学习的对称正定隐藏空间变换。输出头在词汇投影前将此变换应用于隐藏状态,而嵌入层则使用稳定的三角求解将逆变换应用于词符向量,避免了显式的伪逆重计算以及任何词汇量大小的辅助参数。我们在涵盖 2.56 亿至 13 亿参数的设备端模型上对 PIT 在预训练和适配任务中进行评估,一致观察到训练稳定性提升、层间语义一致性增强以及副作用显著减少。