Cascade systems comprise a two-model sequence, with a lightweight model processing all samples and a heavier, higher-accuracy model conditionally refining harder samples to improve accuracy. By placing the light model on the device side and the heavy model on a server, model cascades constitute a widely used distributed inference approach. With the rapid expansion of intelligent indoor environments, such as smart homes, the new setting of Multi-Device Cascade is emerging where multiple and diverse devices are to simultaneously use a shared heavy model on the same server, typically located within or close to the consumer environment. This work presents MultiTASC, a multi-tenancy-aware scheduler that adaptively controls the forwarding decision functions of the devices in order to maximize the system throughput, while sustaining high accuracy and low latency. By explicitly considering device heterogeneity, our scheduler improves the latency service-level objective (SLO) satisfaction rate by 20-25 percentage points (pp) over state-of-the-art cascade methods in highly heterogeneous setups, while serving over 40 devices, showcasing its scalability.
翻译:级联系统由两模型序列构成:轻量级模型处理所有样本,而更重但精度更高的模型则对较难样本进行条件性精炼以提升准确率。通过将轻量模型部署于设备端,重型模型部署于服务器端,模型级联已成为广泛使用的分布式推理方法。随着智能家居等智能室内环境的快速扩展,一种名为"多设备级联"的新兴场景正在涌现——多台异构设备需同时共享位于同一服务器(通常部署在消费环境内部或附近)的重型模型。本文提出MultiTASC——一种多租户感知调度器,通过自适应控制设备的转发决策函数,在维持高精度与低延迟的同时最大化系统吞吐量。通过显式考虑设备异构性,本调度器在高度异构场景下将延迟服务等级协议(SLO)满足率相较于现有最优级联方法提升20-25个百分点,同时支持超过40台设备的服务,充分展示了其可扩展性。