Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. To mitigate this inefficiency, we present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. Inspired by the concept of prompt tuning, we enhance LLMs with a parameter-efficient design called bi-directional tuning for the capability in semi-autoregressive generation. Employing efficient tree-based decoding, the models perform draft candidate generation and verification in parallel, ensuring outputs identical to their autoregressive counterparts under greedy sampling. BiTA serves as a lightweight plug-in module, seamlessly boosting the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7$\times$ speedup on the MT-Bench benchmark. Extensive experiments confirm our method surpasses state-of-the-art acceleration techniques.
翻译:大型语言模型(LLMs)在推理时通常采用自回归生成方式,导致高内存带宽需求并引发延迟增加。为解决这一效率问题,我们提出双向调优无损加速方法(BiTA),通过精简的半自回归生成与草稿验证机制创新性地加速LLMs。受提示调优概念启发,我们采用名为双向调优的参数高效设计增强LLMs在半自回归生成中的能力。通过高效的树形解码架构,模型可并行执行草稿候选生成与验证,确保在贪心采样条件下输出结果与自回归生成完全一致。BiTA作为轻量级即插即用模块,无需额外辅助模型或显著增加内存成本,即可无缝提升现有LLMs的推理效率。应用所提BiTA方法后,LLaMA-2-70B-Chat在MT-Bench基准测试中实现了2.7倍加速。大量实验证明,本方法优于当前最先进的加速技术。