Large Language Models (LLM) and foundation models are popular as they offer new opportunities for individuals and businesses to improve natural language processing, interact with data, and retrieve information faster. However, training or fine-tuning LLMs requires a vast amount of data, which can be challenging to access due to legal or technical restrictions and may require private computing resources. Federated Learning (FL) is a solution designed to overcome these challenges and expand data access for deep learning applications. This paper takes a hardware-centric approach to explore how LLMs can be brought to modern edge computing systems. Our study fine-tunes the FLAN-T5 model family, ranging from 80M to 3B parameters, using FL for a text summarization task. We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions. Our contribution is twofold: First, we evaluate the current capabilities of edge computing systems and their potential for LLM FL workloads. Second, by comparing these systems with a data-center GPU, we demonstrate the potential for improvement and the next steps toward achieving greater computational efficiency at the edge.
翻译:大型语言模型(LLM)与基础模型因能为个人和企业提供改进自然语言处理、加速数据交互与信息检索的新机遇而广受欢迎。然而,训练或微调LLM需要海量数据,这些数据可能因法律或技术限制难以获取,且可能需要私有计算资源。联邦学习(FL)旨在克服这些挑战并扩展深度学习应用的数据访问范围。本文采用硬件驱动的方法,探索如何在现代边缘计算系统中部署LLM。研究针对FLAN-T5模型家族(参数规模从80M到3B)进行联邦文本摘要任务的微调。我们提供了微级硬件基准测试,将模型FLOP利用率与顶尖数据中心GPU进行对比,并在真实网络条件下分析网络利用率。本文贡献分为两方面:首先,评估了当前边缘计算系统的能力及其在LLM联邦学习工作负载中的潜力;其次,通过与数据中心GPU的对比,展示了边缘端性能提升的潜力及实现更高计算效率的下一步方向。