Although the computing power of mobile devices is increasing, machine learning models are also growing in size. This trend creates problems for mobile devices due to limitations like their memory capacity and battery life. While many services, like ChatGPT and Midjourney, run all the inferences in the cloud, we believe a flexible and fine-grained task distribution is more desirable. In this work, we consider model segmentation as a solution to improving the user experience, dividing the computation between mobile devices and the cloud in a way that offloads the compute-heavy portion of the model while minimizing the data transfer required. We show that the division not only reduces the wait time for users but can also be fine-tuned to optimize the workloads of the cloud. To achieve that, we design a scheduler that collects information about network quality, client device capability, and job requirements, making decisions to achieve consistent performance across a range of devices while reducing the work the cloud needs to perform.
翻译:尽管移动设备的计算能力不断提升,机器学习模型的规模也在持续增长。受限于内存容量和电池续航等约束,这一趋势给移动设备带来了挑战。虽然ChatGPT、Midjourney等服务将所有推理任务部署在云端完成,但我们认为更理想的方式是采用灵活且细粒度的任务分配方案。本文提出将模型分割作为提升用户体验的解决方案,通过在移动设备与云端之间划分计算任务,在卸载模型计算密集型部分的同时最小化数据传输量。研究表明,这种划分不仅能减少用户等待时间,还可通过微调优化云端工作负载。为实现该目标,我们设计了一个调度器,通过采集网络质量、客户端设备能力及作业需求等参数,在降低云端计算量的同时确保异构设备间的性能一致性。