Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse Multi-DNN Workloads

Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices, such as mobile phones where multiple tasks serve a single user for daily activities, and data centers, where various requests are raised from millions of users, as seen with large language models. To reduce the costly computational and memory requirements of these workloads, various efficient sparsification approaches have been introduced, resulting in widespread sparsity across different types of DNN models. In this context, there is an emerging need for scheduling sparse multi-DNN workloads, a problem that is largely unexplored in previous literature. This paper systematically analyses the use-cases of multiple sparse DNNs and investigates the opportunities for optimizations. Based on these findings, we propose Dysta, a novel bi-level dynamic and static scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling. Both static and dynamic components of Dysta are jointly designed at the software and hardware levels, respectively, to improve and refine the scheduling approach. To facilitate future progress in the study of this class of workloads, we construct a public benchmark that contains sparse multi-DNN workloads across different deployment scenarios, spanning from mobile phones and AR/VR wearables to data centers. A comprehensive evaluation on the sparse multi-DNN benchmark demonstrates that our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time. Our artifacts and code are publicly available at: https://github.com/SamsungLabs/Sparse-Multi-DNN-Scheduling.

翻译：并行运行多个深度神经网络已成为新兴工作负载，既出现在边缘设备（如手机中多个任务为单个用户的日常活动提供服务）中，也出现在数据中心（如大型语言模型场景下数百万用户提出各类请求）中。为降低这些工作负载的高昂计算与内存需求，多种高效的稀疏化方法被引入，导致不同类型深度神经网络模型广泛出现稀疏性。在此背景下，稀疏多深度神经网络负载的调度需求应运而生，而这一问题在以往文献中尚未得到充分探索。本文系统分析了多个稀疏深度神经网络的使用场景，并研究了优化机遇。基于这些发现，我们提出Dysta——一种新颖的双层动态与静态调度器，它同时利用静态稀疏模式与动态稀疏信息实现稀疏多深度神经网络调度。Dysta的静态与动态组件分别在软件与硬件层面联合设计，以改进和完善调度方法。为促进此类工作负载研究的未来发展，我们构建了一个公开基准测试集，包含从手机、AR/VR可穿戴设备到数据中心等不同部署场景下的稀疏多深度神经网络工作负载。在稀疏多深度神经网络基准测试上的全面评估表明，我们提出的方法优于现有最先进方法，延迟约束违反率降低高达10%，平均归一化周转时间减少近4倍。我们的工具与代码已公开于：https://github.com/SamsungLabs/Sparse-Multi-DNN-Scheduling。