Large language models achieve strong machine translation quality but incur high inference cost and latency, posing challenges for simultaneous translation. Re-translation provides a practical solution for off-the-shelf LLMs by repeatedly regenerating the target output as the source input grows, but it suffers from substantial redundant computation. We propose Self-Speculative Biased Decoding (SSBD), a simple and tuning-free inference method that accelerates re-translation by exploiting temporal coherence in streaming translation. SSBD reuses the model's previous output as a speculative draft for the updated input, verifies the draft efficiently in a single forward pass with a lightweight bias, and resumes autoregressive decoding only from the first divergence. We further introduce a display-only masking strategy that hides unstable suffixes from the user interface while retaining them in the draft for verification and potential acceptance. Experiments show that SSBD achieves substantial speedup over standard re-translation while maintaining comparable translation quality, without architectural changes, auxiliary models, or extra fine-tuning.
翻译:大型语言模型在机器翻译任务中展现出卓越的质量,但其高昂的推理成本和延迟对实时翻译构成了挑战。重翻译为现成的LLMs提供了一种实用解决方案,即随着源语言输入的不断增长而反复重新生成目标输出,但该方法存在大量冗余计算。本文提出自推测偏置解码,这是一种简单且无需调优的推理方法,通过利用流式翻译中的时间连贯性来加速重翻译过程。SSBD将模型先前输出作为更新输入的自推测草稿进行复用,通过轻量级偏置在单次前向传播中高效验证该草稿,并仅从首次分歧处恢复自回归解码。我们进一步引入了一种仅显示掩码策略,该策略在用户界面中隐藏不稳定的后缀,同时在草稿中保留它们以进行验证和潜在接受。实验表明,SSBD在保持可比翻译质量的同时,相比标准重翻译实现了显著加速,且无需架构修改、辅助模型或额外微调。