The increasing size of large language models (LLMs) challenges their usage on resource-constrained platforms. For example, memory on modern GPUs is insufficient to hold LLMs that are hundreds of Gigabytes in size. Offloading is a popular method to escape this constraint by storing weights of an LLM model to host CPU memory and SSD, then loading each weight to GPU before every use. In our case study of offloaded inference, we found that due to the low bandwidth between storage devices and GPU, the latency of transferring large model weights from its offloaded location to GPU memory becomes the critical bottleneck with actual compute taking nearly 0% of runtime. To effectively reduce the weight transfer latency, we propose a novel sparse format that compresses the unstructured sparse pattern of pruned LLM weights to non-zero values with high compression ratio and low decompression overhead. Endor achieves this by expressing the positions of non-zero elements with a bitmap. Compared to offloaded inference using the popular Huggingface Accelerate, applying Endor accelerates OPT-66B by 1.70x and Llama2-70B by 1.78x. When direct weight transfer from SSD to GPU is leveraged, Endor achieves 2.25x speedup on OPT-66B and 2.37x speedup on Llama2-70B.
翻译:大语言模型(LLM)规模的不断增长对其在资源受限平台上的应用提出了挑战。例如,现代GPU的内存不足以容纳数百GB大小的LLM。卸载是一种流行的突破此限制的方法,它将LLM模型的权重存储到主机CPU内存和SSD中,然后在每次使用前将每个权重加载到GPU。在我们的卸载推理案例研究中,我们发现由于存储设备与GPU之间的带宽较低,将大型模型权重从其卸载位置传输到GPU内存的延迟成为关键瓶颈,而实际计算仅占运行时间的近0%。为了有效降低权重传输延迟,我们提出了一种新颖的稀疏格式,该格式以高压缩比和低解压开销,将剪枝后LLM权重的非结构化稀疏模式压缩至非零值。Endor通过使用位图表示非零元素的位置来实现这一点。与使用流行的Huggingface Accelerate进行卸载推理相比,应用Endor可将OPT-66B加速1.70倍,将Llama2-70B加速1.78倍。当利用从SSD到GPU的直接权重传输时,Endor在OPT-66B上实现了2.25倍加速,在Llama2-70B上实现了2.37倍加速。