Deep neural network (DNN) video analytics is crucial for autonomous systems such as self-driving vehicles, unmanned aerial vehicles (UAVs), and security robots. However, real-world deployment faces challenges due to their limited computational resources and battery power. To tackle these challenges, continuous learning exploits a lightweight "student" model at deployment (inference), leverages a larger "teacher" model for labeling sampled data (labeling), and continuously retrains the student model to adapt to changing scenarios (retraining). This paper highlights the limitations in state-of-the-art continuous learning systems: (1) they focus on computations for retraining, while overlooking the compute needs for inference and labeling, (2) they rely on power-hungry GPUs, unsuitable for battery-operated autonomous systems, and (3) they are located on a remote centralized server, intended for multi-tenant scenarios, again unsuitable for autonomous systems due to privacy, network availability, and latency concerns. We propose a hardware-algorithm co-designed solution for continuous learning, DaCapo, that enables autonomous systems to perform concurrent executions of inference, labeling, and training in a performant and energy-efficient manner. DaCapo comprises (1) a spatially-partitionable and precision-flexible accelerator enabling parallel execution of kernels on sub-accelerators at their respective precisions, and (2) a spatiotemporal resource allocation algorithm that strategically navigates the resource-accuracy tradeoff space, facilitating optimal decisions for resource allocation to achieve maximal accuracy. Our evaluation shows that DaCapo achieves 6.5% and 5.5% higher accuracy than a state-of-the-art GPU-based continuous learning systems, Ekya and EOMU, respectively, while consuming 254x less power.
翻译:论文摘要:深度神经网络视频分析对自动驾驶车辆、无人机及安防机器人等自主系统至关重要。然而,实际部署面临计算资源与电池容量受限的双重挑战。持续学习通过以下机制应对这些挑战:部署阶段采用轻量级"学生"模型执行推理,利用大型"教师"模型对采样数据进行标注,并持续对学生模型进行再训练以适应动态场景。本研究指出现有持续学习系统的三大局限:(1)过度关注再训练计算而忽略推理与标注的算力需求;(2)依赖高功耗GPU,难以适配电池供电的自主系统;(3)部署于远程中央服务器(面向多租户场景),因隐私、网络可用性及延迟等问题不适用于自主系统。我们提出硬件-算法协同设计的持续学习方案DaCapo,使自主系统能够高效并行执行推理、标注与训练任务。DaCapo包含:(1)支持空间分区与精度可调的加速器,可在子加速器上以各自精度并行执行算子;(2)时空资源分配算法,通过策略性探索资源-精度权衡空间,实现面向最大精度的最优资源分配决策。评估表明,相较于当前最先进的GPU持续学习系统Ekya与EOMU,DaCapo在功耗降低254倍的同时,分别实现6.5%与5.5%的精度提升。