Deploying large language models (LLMs) in real-time systems remains challenging due to their substantial computational demands and privacy concerns. We propose Floe, a hybrid federated learning framework designed for latency-sensitive, resource-constrained environments. Floe combines a cloud-based black-box LLM with lightweight small language models (SLMs) on edge devices to enable low-latency, privacy-preserving inference. Personal data and fine-tuning remain on-device, while the cloud LLM contributes general knowledge without exposing proprietary weights. A heterogeneity-aware LoRA adaptation strategy enables efficient edge deployment across diverse hardware, and a logit-level fusion mechanism enables real-time coordination between edge and cloud models. Extensive experiments demonstrate that Floe enhances user privacy and personalization. Moreover, it significantly improves model performance and reduces inference latency on edge devices under real-time constraints compared with baseline approaches.
翻译:在实时系统中部署大语言模型(LLMs)仍面临挑战,主要因其巨大的计算需求和隐私顾虑。我们提出了Floe,一个专为时延敏感、资源受限环境设计的混合联邦学习框架。Floe将云端黑盒大语言模型与边缘设备上的轻量级小语言模型(SLMs)相结合,以实现低延迟、保护隐私的推理。个人数据和微调过程保留在设备端,而云端大语言模型则贡献通用知识,无需暴露其专有权重。一种异构感知的LoRA适配策略支持在不同硬件上实现高效的边缘部署,而一个对数概率层级的融合机制则实现了边缘与云端模型之间的实时协同。大量实验表明,Floe显著增强了用户隐私和个性化能力。此外,与基线方法相比,在实时约束下,它显著提升了模型性能并降低了边缘设备上的推理延迟。