Large language models (LLMs) have unlocked a plethora of powerful applications at the network edge, such as intelligent personal assistants. Data privacy and security concerns have prompted a shift towards edge-based fine-tuning of personal LLMs, away from cloud reliance. However, this raises issues of computational intensity and resource scarcity, hindering training efficiency and feasibility. While current studies investigate parameter-efficient fine-tuning (PEFT) techniques to mitigate resource constraints, our analysis indicates that these techniques are not sufficiently resource-efficient for edge devices. To tackle these challenges, we propose Pluto and Charon (PAC), a time and memory efficient collaborative edge AI framework for personal LLMs fine-tuning. PAC breaks the resource wall of personal LLMs fine-tuning with a sophisticated algorithm-system co-design. (1) Algorithmically, PAC implements a personal LLMs fine-tuning technique that is efficient in terms of parameters, time, and memory. It utilizes Parallel Adapters to circumvent the need for a full backward pass through the LLM backbone. Additionally, an activation cache mechanism further streamlining the process by negating the necessity for repeated forward passes across multiple epochs. (2) Systematically, PAC leverages edge devices in close proximity, pooling them as a collective resource for in-situ personal LLMs fine-tuning, utilizing a hybrid data and pipeline parallelism to orchestrate distributed training. The use of the activation cache eliminates the need for forward pass through the LLM backbone,enabling exclusive fine-tuning of the Parallel Adapters using data parallelism. Extensive evaluation based on prototype implementation demonstrates that PAC remarkably outperforms state-of-the-art approaches, achieving up to 8.64x end-to-end speedup and up to 88.16% reduction in memory footprint.
翻译:大语言模型(LLMs)在网络边缘解锁了大量强大的应用,例如智能个人助手。数据隐私和安全问题促使个人LLMs的微调从依赖云端转向基于边缘的方案。然而,这带来了计算密集和资源稀缺的问题,阻碍了训练效率和可行性。虽然当前研究探索参数高效微调(PEFT)技术以缓解资源限制,但我们的分析表明,这些技术对于边缘设备而言资源效率仍显不足。为应对这些挑战,我们提出了Pluto与Charon(PAC),一个面向个人LLMs微调的高效时间与内存协同边缘AI框架。PAC通过精密的算法-系统协同设计,突破了个人LLMs微调的资源壁垒。(1)在算法层面,PAC实现了一种在参数、时间和内存方面均高效的个人LLMs微调技术。它利用并行适配器(Parallel Adapters)来避免对LLM主干网络进行完整的反向传播。此外,激活缓存机制通过消除在多轮训练中重复进行前向传播的必要性,进一步优化了流程。(2)在系统层面,PAC利用邻近的边缘设备,将其汇集为集体资源,用于现场个人LLMs微调,并采用混合数据并行与流水线并行来协调分布式训练。激活缓存的使用消除了对LLM主干网络进行前向传播的需求,使得可以专门使用数据并行对并行适配器进行微调。基于原型实现的广泛评估表明,PAC显著优于现有先进方法,实现了最高达8.64倍的端到端加速,并减少了最高达88.16%的内存占用。