Generative tasks, such as text generation and question answering, hold a crucial position in the realm of mobile applications. Due to their sensitivity to privacy concerns, there is a growing demand for their execution directly on mobile devices. Currently, the execution of these generative tasks heavily depends on Large Language Models (LLMs). Nevertheless, the limited memory capacity of these devices presents a formidable challenge to the scalability of such models. In our research, we introduce LLMCad, an innovative on-device inference engine specifically designed for efficient generative Natural Language Processing (NLP) tasks. The core idea behind LLMCad revolves around model collaboration: a compact LLM, residing in memory, takes charge of generating the most straightforward tokens, while a high-precision LLM steps in to validate these tokens and rectify any identified errors. LLMCad incorporates three novel techniques: (1) Instead of generating candidate tokens in a sequential manner, LLMCad employs the smaller LLM to construct a token tree, encompassing a wider range of plausible token pathways. Subsequently, the larger LLM can efficiently validate all of these pathways simultaneously. (2) It employs a self-adjusting fallback strategy, swiftly initiating the verification process whenever the smaller LLM generates an erroneous token. (3) To ensure a continuous flow of token generation, LLMCad speculatively generates tokens during the verification process by implementing a compute-IO pipeline. Through an extensive series of experiments, LLMCad showcases an impressive token generation speed, achieving rates up to 9.3x faster than existing inference engines.
翻译:生成式任务(如文本生成和问答)在移动应用领域占据关键地位。由于其对隐私问题的敏感性,直接在移动设备上执行此类任务的需求日益增长。当前,这些生成式任务的执行高度依赖大语言模型(LLMs)。然而,设备有限的内存容量对此类模型的可扩展性提出了严峻挑战。本研究中,我们提出LLMCad——一种专为高效生成式自然语言处理(NLP)任务设计的创新设备端推理引擎。其核心思想围绕模型协作展开:常驻内存的轻量级LLM负责生成最简单的词元,而高精度LLM则介入验证这些词元并修正识别出的错误。LLMCad集成了三项创新技术:(1)不同于顺序生成候选词元,LLMCad利用较小的LLM构建词元树,涵盖更广泛的可行词元路径,随后由较大的LLM高效并行验证所有路径;(2)采用自适应回退策略,在较小LLM生成错误词元时迅速启动验证流程;(3)为确保词元生成的连续性,LLMCad通过实现计算-IO流水线,在验证过程中投机生成词元。通过大量实验,LLMCad展示了令人瞩目的词元生成速度,相比现有推理引擎最高提升达9.3倍。