Machine learning models deployed on edge devices have enabled numerous exciting new applications, such as humanoid robots, AR glasses, and autonomous vehicles. However, the computing resources available on these edge devices are not catching up with the ever-growing number of parameters in these models. As the models become bigger and more complicated, the novel yet sophisticated structure challenges the inference runtime optimization. We present FluidML, a generic runtime memory management and optimization framework that can flexibly transform the model execution blueprint to achieve faster and more memory-efficient inference. Evaluations across different platforms show that FluidML can consistently reduce the end-to-end inference latency by up to 25.38% for popular language models and reduce peak memory usage by up to 41.47%, compared to state-of-the-art approaches. FluidML is of ~30K line of codes, built for general-purpose usage, and will be released as an open-source inference runtime optimization framework to the community.
翻译:部署在边缘设备上的机器学习模型催生了许多令人兴奋的新应用,例如人形机器人、增强现实眼镜和自动驾驶汽车。然而,这些边缘设备上可用的计算资源难以跟上这些模型中不断增长的参数量。随着模型变得更大、更复杂,其新颖而复杂的结构对推理运行时的优化提出了挑战。本文提出FluidML,一个通用的运行时内存管理与优化框架,它能够灵活地转换模型执行蓝图,以实现更快、更内存高效的推理。在不同平台上的评估表明,与现有最先进方法相比,FluidML能够持续为流行的语言模型降低端到端推理延迟最高达25.38%,并将峰值内存使用量降低最高达41.47%。FluidML包含约3万行代码,为通用目的而构建,并将作为开源推理运行时优化框架向社区发布。