Deploying Large Language Models (LLMs) locally on mobile devices presents a significant challenge due to their extensive memory requirements. In this paper, we introduce LinguaLinked, a system for decentralized, distributed LLM inference on mobile devices. LinguaLinked enables collaborative execution of the inference task across multiple trusted devices. LinguaLinked ensures data privacy by processing information locally. LinguaLinked uses three key strategies. First, an optimized model assignment technique segments LLMs and uses linear optimization to align segments with each device's capabilities. Second, an optimized data transmission mechanism ensures efficient and structured data flow between model segments while also maintaining the integrity of the original model structure. Finally, LinguaLinked incorporates a runtime load balancer that actively monitors and redistributes tasks among mobile devices to prevent bottlenecks, enhancing the system's overall efficiency and responsiveness. We demonstrate that LinguaLinked facilitates efficient LLM inference while maintaining consistent throughput and minimal latency through extensive testing across various mobile devices, from high-end to low-end Android devices. In our evaluations, compared to the baseline, LinguaLinked achieves an inference performance acceleration of $1.11\times$ to $1.61\times$ in single-threaded settings, $1.73\times$ to $2.65\times$ with multi-threading. Additionally, runtime load balancing yields an overall inference acceleration of $1.29\times$ to $1.32\times$.
翻译:在移动设备本地部署大语言模型(LLM)因其巨大的内存需求而面临重大挑战。本文提出LinguaLinked,一种面向移动设备的去中心化分布式LLM推理系统。该系统支持在多个可信设备间协同执行推理任务,通过本地处理信息确保数据隐私。LinguaLinked采用三项核心策略:第一,优化模型分配技术,将LLM分段并通过线性优化使各段与设备能力对齐;第二,优化数据传输机制,在保证模型原始结构完整性的同时,实现模型段间高效结构化的数据流动;第三,集成运行时负载均衡器,主动监控并重新分配移动设备间的任务以避免瓶颈,提升系统整体效率与响应速度。通过在从高端到低端Android设备的多类型移动设备上进行广泛测试,我们证明LinguaLinked能在保持稳定吞吐量和极低延迟的同时实现高效LLM推理。与基线相比,在单线程环境下推理性能加速比达1.11倍至1.61倍,多线程环境下达1.73倍至2.65倍,此外运行时负载均衡使整体推理加速比达1.29倍至1.32倍。