Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. To mitigate this inefficiency, we present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. Inspired by the concept of prompt tuning, we enhance LLMs with a parameter-efficient design called bi-directional tuning for the capability in semi-autoregressive generation. Employing efficient tree-based decoding, the models perform draft candidate generation and verification in parallel, ensuring outputs identical to their autoregressive counterparts under greedy sampling. BiTA serves as a lightweight plug-in module, seamlessly boosting the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7$\times$ speedup on the MT-Bench benchmark. Extensive experiments confirm our method surpasses state-of-the-art acceleration techniques.
翻译:大型语言模型(LLMs)在推理过程中通常采用自回归生成方式,导致内存带宽需求过高,进而延长延迟。为解决此低效问题,我们提出双向调优无损加速方法(BiTA),通过流线化半自回归生成与草稿验证实现LLMs的高效加速。受提示调优启发,我们采用称为双向调优的参数高效设计增强LLMs的半自回归生成能力。通过高效树形解码结构,模型可并行执行草稿候选生成与验证,确保在贪婪采样下输出与自回归方法完全一致。BiTA作为轻量化插件模块,无需额外辅助模型或显著增加内存成本即可直接提升现有LLMs的推理效率。应用所提BiTA方法,LLaMA-2-70B-Chat在MT-Bench基准测试中实现2.7倍加速。大量实验证实,本方法优于当前最先进加速技术。