Autoregressive decoding in Large Language Models (LLMs) is inherently sequential, creating a latency bottleneck that scales linearly with output length. While ``Decomposition-and-Fill'' methods like Skeleton-of-Thought attempt to parallelize generation via external orchestration, they suffer from \textit{coherence drift} due to the lack of cross-stream communication. In this work, we introduce the \textbf{Parallel Decoder Transformer (PDT)}, a parameter-efficient architecture that embeds coordination primitives directly into the inference process of a frozen pre-trained model. Instead of retraining the base model, PDT injects lightweight \textit{Speculative Note Conditioning (SNC)} adapters that allow parallel decoding streams to synchronize via a shared, dynamic latent space. We formulate coordination as a \textit{speculative consensus} problem, where sibling streams broadcast semantic ``notes'' to a global bus, gated by a learned verification head. We validate our approach on a 50,000-step curriculum using a frozen 20B-parameter backbone. Our results demonstrate that PDT achieves effective self-correction, reaching \textbf{77.8\% precision} in coverage prediction and recovering approximate serial semantics without modifying the trunk weights. This establishes PDT as a scalable, efficient alternative to full model fine-tuning for structured parallel generation.
翻译:大型语言模型(LLMs)中的自回归解码本质上是顺序执行的,这导致了与输出长度呈线性关系的延迟瓶颈。虽然诸如“思维骨架”等“分解-填充”方法尝试通过外部编排实现并行生成,但由于缺乏跨流通信,它们存在\textit{连贯性漂移}问题。在本研究中,我们提出了\textbf{并行解码Transformer(PDT)},这是一种参数高效的架构,将协调原语直接嵌入到冻结预训练模型的推理过程中。PDT无需重新训练基础模型,而是注入轻量级的\textit{推测性注释条件化(SNC)}适配器,使并行解码流能够通过共享的动态潜在空间进行同步。我们将协调问题形式化为一个\textit{推测性共识}问题,其中兄弟流将语义“注释”广播到全局总线,并由一个学习得到的验证头进行门控。我们在一个冻结的200亿参数骨干网络上,通过50,000步课程训练验证了我们的方法。结果表明,PDT实现了有效的自校正,在覆盖预测中达到了\textbf{77.8\%的精确度},并在不修改主干权重的情况下恢复了近似的序列语义。这确立了PDT作为一种可扩展、高效的替代方案,适用于结构化并行生成,无需进行完整的模型微调。