Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.
翻译:Transformer组件(如非线性激活和归一化)本质上是非单射的,这表明不同的输入可能映射到相同的输出,从而阻碍从模型表示中精确恢复输入。本文中,我们挑战了这一观点。首先,我们从数学上证明了将离散输入序列映射到其对应连续表示序列的Transformer语言模型是单射的,因此是无损的;这一性质在初始化时即成立,并在训练过程中得以保持。其次,我们通过对六个最先进的语言模型进行数十亿次碰撞测试,实证验证了这一结果,且未观察到任何碰撞。第三,我们将单射性付诸实践:提出了SipIt算法,这是首个能够从隐藏激活中可证明且高效地重建精确输入文本的算法,该算法具有线性时间保证,并在实践中实现了精确可逆性。总体而言,我们的工作确立了单射性作为语言模型的一种基本且可利用的性质,对模型透明度、可解释性及安全部署具有直接意义。