In this paper, we present Mondrian, an edge system that enables high-performance object detection on high-resolution video streams. Many lightweight models and system optimization techniques have been proposed for resource-constrained devices, but they do not fully utilize the potential of the accelerators over dynamic, high-resolution videos. To enable such capability, we devise a novel Compressive Packed Inference to minimize per-pixel processing costs by selectively determining the necessary pixels to process and combining them to maximize processing parallelism. In particular, our system quickly extracts ROIs and dynamically shrinks them, reflecting the effect of the fast-changing characteristics of objects and scenes. It then intelligently combines such scaled ROIs into large canvases to maximize the utilization of inference accelerators such as GPU. Evaluation across various datasets, models, and devices shows Mondrian outperforms state-of-the-art baselines (e.g., input rescaling, ROI extractions, ROI extractions+batching) by 15.0-19.7% higher accuracy, leading to $\times$6.65 higher throughput than frame-wise inference for processing various 1080p video streams. We will release the code after the paper review.
翻译:本文提出蒙德里安(Mondrian)——一种面向高分辨率视频流的高性能边缘目标检测系统。现有轻量级模型与系统优化技术虽已用于资源受限设备,但未能充分利用动态高分辨率视频中加速器的潜在能力。为突破此局限,我们创新性地提出压缩打包推理(Compressive Packed Inference)方法:通过选择性确定需处理的像素并组合优化处理并行度,最小化单像素计算代价。具体而言,系统快速提取感兴趣区域(ROI)并动态缩放,精准反映目标与场景快速变化的特征;继而智能地将缩放后的ROI拼接为大型画布,最大化GPU等推理加速器的利用率。在多种数据集、模型与设备上的评估表明,蒙德里安较输入缩放、ROI提取、ROI提取+批处理等现有基线方法,准确率提升15.0-19.7%,处理1080p视频流时吞吐量达逐帧推理的6.65倍。论文评审后将开源相关代码。