Autoregressive large language models (LLMs) generate text token-by-token, requiring n forward passes to produce a sequence of length n. Recent work, Exploring the Latent Capacity of LLMs for One-Step Text Reconstruction (Mezentsev and Oseledets), shows that frozen LLMs can reconstruct hundreds of tokens from only two learned proto-tokens in a single forward pass, suggesting a path beyond the autoregressive paradigm. In this paper, we study what information these proto-tokens encode and how they behave under reconstruction and controlled constraints. We perform a series of experiments aimed at disentangling semantic and syntactic content in the two proto-tokens, analyzing stability properties of the e-token, and visualizing attention patterns to the e-token during reconstruction. Finally, we test two regularization schemes for "imposing" semantic structure on the e-token using teacher embeddings, including an anchor-based loss and a relational distillation objective. Our results indicate that the m-token tends to capture semantic information more strongly than the e-token under standard optimization; anchor-based constraints trade off sharply with reconstruction accuracy; and relational distillation can transfer batch-level semantic relations into the proto-token space without sacrificing reconstruction quality, supporting the feasibility of future non-autoregressive seq2seq systems that predict proto-tokens as an intermediate representation.
翻译:自回归大型语言模型(LLMs)通过逐令牌生成文本,需要执行n次前向传播才能产生长度为n的序列。近期研究《探索LLMs在一步文本重建中的潜在能力》(Mezentsev与Oseledets)表明,冻结参数的LLMs仅通过两个习得的原型令牌在单次前向传播中即可重建数百个令牌,这为超越自回归范式提供了可能路径。本文旨在探究这些原型令牌编码的信息类型及其在重建与受控约束下的行为特性。我们设计了一系列实验以解耦两个原型令牌中的语义与句法内容,分析e-令牌的稳定性特征,并可视化重建过程中对e-令牌的注意力模式。最后,我们测试了两种基于教师嵌入的e-令牌语义结构“施加”正则化方案,包括锚点损失函数与关系蒸馏目标。实验结果表明:在标准优化条件下,m-令牌比e-令牌更倾向于捕获语义信息;锚点约束会显著影响重建精度;而关系蒸馏能够在保持重建质量的前提下将批次级语义关系迁移至原型令牌空间,这为未来以预测原型令牌作为中间表示的非自回归序列到序列系统的可行性提供了支持。