Modern edge applications increasingly require multi-DNN inference systems to execute tasks on heterogeneous processors, gaining performance from both concurrent execution and from matching each model to the most suited accelerator. However, existing systems support only a single model (or a few sparse variants) per task, which impedes the efficiency of this matching and results in high Service Level Objective violation rates. We introduce model stitching for multi-DNN inference systems, which creates model variants by recombining subgraphs from sparse models without re-training. We present a demonstrator system, SparseLoom, that shows model stitching can be deployed to SoCs. We show experimentally that SparseLoom reduces SLO violation rates by up to 74%, improves throughput by up to 2.31x, and lowers memory overhead by an average of 28% compared to state-of-the-art multi-DNN inference systems.
翻译:现代边缘应用日益需要多DNN推理系统在异构处理器上执行任务,通过并发执行以及将每个模型匹配至最适合的加速器来提升性能。然而,现有系统每个任务仅支持单一模型(或少数稀疏变体),这阻碍了此类匹配的效率,并导致较高的服务等级目标违反率。我们为多DNN推理系统引入了模型拼接技术,该技术通过重组稀疏模型的子图来创建模型变体,而无需重新训练。我们展示了一个演示系统SparseLoom,表明模型拼接可部署至SoC。实验表明,与最先进的多DNN推理系统相比,SparseLoom将SLO违反率降低高达74%,吞吐量提升高达2.31倍,并平均降低28%的内存开销。