Transformer-based large language models (LLMs) have demonstrated impressive capabilities in a variety of natural language processing (NLP) tasks. Nonetheless, it is challenging to deploy and fine-tune LLMs on mobile edge devices with limited computing, memory, and energy budgets. In this paper, we propose Confidant, a multi-backend collaborative training framework for customizing state-of-the-art LLMs on commodity mobile devices like smartphones. Confidant partitions an LLM into several sub-models so that each fits into a mobile device's memory. A pipeline parallel training mechanism is further developed to ensure fast and efficient distributed training. In addition, we propose a novel backend scheduler to allocate different attention heads to heterogeneous compute hardware, including mobile CPU and GPUs, to maximize the compute resource utilization on each edge device. Our preliminary experimental results show that Confidant achieves at most 45.3% memory reduction and 8.03x inference speedup in practical settings.
翻译:基于Transformer的大型语言模型(LLMs)已在各类自然语言处理(NLP)任务中展现出卓越能力。然而,在计算能力、内存和能源预算均受限的移动边缘设备上部署和微调LLMs仍面临挑战。本文提出Confidant,一种面向消费级移动设备(如智能手机)的多后端协作训练框架,用于定制当前最先进的LLMs。该框架将LLM分割为若干子模型,确保每个子模型适配移动设备内存容量。我们进一步开发了流水线并行训练机制,以实现快速高效的分布式训练。此外,还提出一种新颖的后端调度器,可将不同注意力头分配到异构计算硬件(包括移动端CPU与GPU),以最大化每台边缘设备的计算资源利用率。初步实验结果表明,在实际场景中Confidant可实现最高45.3%的内存缩减和8.03倍的推理加速。