In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs. Based on this framework, HeteGen further employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks. Our experiments demonstrate a substantial improvement in inference speed, surpassing state-of-the-art methods by over 317% at most.
翻译:近期,大语言模型(LLMs)的涌现导致模型规模日益增大,给低资源设备上的推理带来了挑战。以往的研究探索了卸载方法以实现低内存推理,但常因I/O瓶颈而效率低下。为在资源受限设备上实现低延迟的LLMs推理,我们提出了HeteGen,这是一种新颖方法,为利用CPU和GPU进行异构并行计算提供了原则性框架。基于该框架,HeteGen进一步采用异构并行计算与异步重叠技术来缓解LLMs的I/O瓶颈。我们的实验表明,推理速度显著提升,最高可超越现有最先进方法超过317%。