The rapid growth of memory and computation requirements of large language models (LLMs) has outpaced the development of hardware, hindering people who lack large-scale high-end GPUs from training or deploying LLMs. However, consumer-level GPUs, which constitute a larger market share, are typically overlooked in LLM due to their weaker computing performance, smaller storage capacity, and lower communication bandwidth. Additionally, users may have privacy concerns when interacting with remote LLMs. In this paper, we envision a decentralized system unlocking the potential vast untapped consumer-level GPUs in pre-training, inference and fine-tuning of LLMs with privacy protection. However, this system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity. To address these challenges, our system design incorporates: 1) a broker with backup pool to implement dynamic join and quit of computing providers; 2) task scheduling with hardware performance to improve system efficiency; 3) abstracting ML procedures into directed acyclic graphs (DAGs) to achieve model and task universality; 4) abstracting intermediate represention and execution planes to ensure compatibility of various devices and deep learning (DL) frameworks. Our performance analysis demonstrates that 50 RTX 3080 GPUs can achieve throughputs comparable to those of 4 H100 GPUs, which are significantly more expensive.
翻译:大语言模型(LLM)在内存和计算需求上的快速增长已超越硬件发展水平,阻碍了缺乏大规模高端GPU资源的用户训练或部署LLM。然而,占据更大市场份额的消费级GPU因计算性能较弱、存储容量较小及通信带宽较低,在LLM领域常被忽视。此外,用户在与远程LLM交互时可能面临隐私问题。本文提出一种去中心化系统,通过隐私保护机制释放大量未被充分利用的消费级GPU在LLM预训练、推理与微调中的潜力。然而,该系统面临关键挑战,包括有限的CPU与GPU内存、低网络带宽、节点可变性及设备异构性。为应对这些挑战,我们设计了如下系统方案:1)引入带有备份池的中介代理,实现计算提供者的动态加入与退出;2)基于硬件性能的任务调度以提升系统效率;3)将机器学习流程抽象为有向无环图(DAG),实现模型与任务的通用性;4)抽象中间表示层与执行层,确保异构设备与深度学习框架的兼容性。性能分析表明,50块RTX 3080 GPU可实现与4块H100 GPU(价格显著更高)相当的吞吐量。