Large Language Models (LLM) and foundation models are popular as they offer new opportunities for individuals and businesses to improve natural language processing, interact with data, and retrieve information faster. However, training or fine-tuning LLMs requires a vast amount of data, which can be challenging to access due to legal or technical restrictions and may require private computing resources. Federated Learning (FL) is a solution designed to overcome these challenges and expand data access for deep learning applications. This paper takes a hardware-centric approach to explore how LLMs can be brought to modern edge computing systems. Our study fine-tunes the FLAN-T5 model family, ranging from 80M to 3B parameters, using FL for a text summarization task. We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions. Our contribution is twofold: First, we evaluate the current capabilities of edge computing systems and their potential for LLM FL workloads. Second, by comparing these systems with a data-center GPU, we demonstrate the potential for improvement and the next steps toward achieving greater computational efficiency at the edge.
翻译:大型语言模型(LLM)与基础模型因能为个人和企业提供改进自然语言处理、数据交互及信息检索速度的新机遇而备受瞩目。然而,训练或微调LLM需要海量数据,受限于法律或技术约束,这些数据往往难以获取,同时还需占用大量私有计算资源。联邦学习(FL)作为一种解决方案,旨在突破上述瓶颈,拓展深度学习应用的数据访问范围。本文采用以硬件为中心的视角,探索如何将LLM部署至现代边缘计算系统。我们针对FLAN-T5模型家族(参数量从8000万至30亿不等)开展基于FL的文本摘要任务微调研究,在微观层面提供硬件基准测试,对比模型FLOP利用率与数据中心级GPU的差异,并分析实际场景下的网络负载特征。本研究贡献体现在两方面:其一,评估当前边缘计算系统处理LLM联邦学习工作负载的能力边界;其二,通过与数据中心GPU的对比分析,揭示边缘端计算效率的提升空间与未来优化路径。