To tackle the high inference latency exhibited by autoregressive language models, previous studies have proposed an early-exiting framework that allocates adaptive computation paths for each token based on the complexity of generating the subsequent token. However, we observed several shortcomings, including performance degradation caused by a state copying mechanism or numerous exit paths, and sensitivity to exit confidence thresholds. Consequently, we propose a Fast and Robust Early-Exiting (FREE) framework, which incorporates a shallow-deep module and a synchronized parallel decoding. Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens. Furthermore, as parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator that exploits a Beta mixture model to determine suitable confidence thresholds. We empirically demonstrated the superiority of our proposed framework on extensive generation tasks.
翻译:针对自回归语言模型推理延迟高的问题,已有研究提出早期退出框架,根据生成后续标记的复杂度为每个标记分配自适应计算路径。然而,我们观察到该框架存在若干缺陷:状态复制机制或过多退出路径导致的性能退化,以及对退出置信度阈值的敏感性。为此,我们提出快速鲁棒早期退出(FREE)框架,该框架集成了浅深模块与同步并行解码。通过将当前标记的解码过程与先前堆积的早期退出标记同步进行,实现了更快的推理速度。此外,由于并行解码允许同时观测浅层模型与深层模型的预测结果,我们提出了一种新颖的自适应阈值估计器,利用贝塔混合模型确定合适的置信度阈值。通过大规模生成任务的实验结果,验证了所提框架的优越性。