Autoregressive (AR) language modeling is the dominant paradigm for text generation, yet its sequential token-by-token decoding makes inference memory-bound and inefficient. Existing acceleration approaches, such as speculative decoding and diffusion language models, can yield speedups under certain conditions but do not directly address high-load batch serving--the scenario most critical for industrial-scale deployment. We introduce K-Forcing, a push-forward language modeling paradigm for joint next-k-token decoding. K-Forcing distills an existing AR model into a conditional push-forward mapping--one that transforms independent uniform noise variables into a joint sample of multiple future tokens in a single forward pass. This design preserves fixed-length outputs, reuses the AR teacher backbone, and remains compatible with standard AR serving infrastructure. We train this mapping via progressive self-forcing distillation, which gradually expands the prediction window while enabling the student to closely match the sequence distribution of the AR teacher. We evaluate K-Forcing on LM1B and OpenWebText using a standard causal Transformer backbone. When aggressively configured to generate k = 4 tokens per forward pass, K-Forcing delivers approximately 2.4-3.5x speedup across different batch sizes, while incurring modest quality degradation relative to its AR teacher. As inference increasingly dominates the lifetime compute cost of modern LLMs, K-Forcing offers a promising route toward accelerating AR generation under real-world high-load deployment.
翻译:自回归(AR)语言建模是文本生成的主导范式,但其逐Token顺序解码导致推理过程受内存限制且效率低下。现有的加速方法,如推测解码和扩散语言模型,在特定条件下可提升速度,但无法直接解决工业级部署中最为关键的高负载批量服务问题。我们提出K-Forcing,一种用于联合解码下一K个Token的前推语言建模范式。K-Forcing将现有AR模型蒸馏为条件前推映射——该映射在一次前向传播中将独立均匀噪声变量转换为多个未来Token的联合样本。该设计保留固定长度输出,复用AR教师模型主干,并与标准AR服务基础设施兼容。我们通过渐进式自强制蒸馏训练该映射,逐步扩展预测窗口,同时使学生模型紧密匹配AR教师模型的序列分布。我们在LM1B和OpenWebText数据集上使用标准因果Transformer主干评估K-Forcing。当激进配置为每次前向传播生成k=4个Token时,K-Forcing在不同批量大小下实现约2.4-3.5倍加速,同时相对于其AR教师模型仅产生适度质量下降。随着推理计算在当代大语言模型生命周期成本中占据主导地位,K-Forcing为真实高负载部署场景下加速AR生成提供了有前景的路径。