并行解码Transformer：通过注释条件化的推测不变性实现模型内部并行解码 (Parallel Decoder Transformer: Model-Internal Parallel Decoding with Speculative Invariance via Note Conditioning)

Autoregressive decoding in Large Language Models (LLMs) is inherently sequential, creating a latency bottleneck that scales linearly with output length. While ``Decomposition-and-Fill'' methods like Skeleton-of-Thought attempt to parallelize generation via external orchestration, they suffer from \textit{coherence drift} due to the lack of cross-stream communication. In this work, we introduce the \textbf{Parallel Decoder Transformer (PDT)}, a parameter-efficient architecture that embeds coordination primitives directly into the inference process of a frozen pre-trained model. Instead of retraining the base model, PDT injects lightweight \textit{Speculative Note Conditioning (SNC)} adapters that allow parallel decoding streams to synchronize via a shared, dynamic latent space. We formulate coordination as a \textit{speculative consensus} problem, where sibling streams broadcast semantic ``notes'' to a global bus, gated by a learned verification head. We validate our approach on a 50,000-step curriculum using a frozen 20B-parameter backbone. Our results demonstrate that PDT achieves effective self-correction, reaching \textbf{77.8\% precision} in coverage prediction and recovering approximate serial semantics without modifying the trunk weights. This establishes PDT as a scalable, efficient alternative to full model fine-tuning for structured parallel generation.

翻译：大型语言模型（LLMs）中的自回归解码本质上是顺序执行的，这导致了与输出长度呈线性关系的延迟瓶颈。虽然诸如“思维骨架”等“分解-填充”方法尝试通过外部编排实现并行生成，但由于缺乏跨流通信，它们存在\textit{连贯性漂移}问题。在本研究中，我们提出了\textbf{并行解码Transformer（PDT）}，这是一种参数高效的架构，将协调原语直接嵌入到冻结预训练模型的推理过程中。PDT无需重新训练基础模型，而是注入轻量级的\textit{推测性注释条件化（SNC）}适配器，使并行解码流能够通过共享的动态潜在空间进行同步。我们将协调问题形式化为一个\textit{推测性共识}问题，其中兄弟流将语义“注释”广播到全局总线，并由一个学习得到的验证头进行门控。我们在一个冻结的200亿参数骨干网络上，通过50,000步课程训练验证了我们的方法。结果表明，PDT实现了有效的自校正，在覆盖预测中达到了\textbf{77.8\%的精确度}，并在不修改主干权重的情况下恢复了近似的序列语义。这确立了PDT作为一种可扩展、高效的替代方案，适用于结构化并行生成，无需进行完整的模型微调。

相关内容