A rising research challenge is running costly machine learning (ML) networks locally on resource-constrained edge devices. ML networks with large convolutional layers can easily exceed available memory, increasing latency due to excessive OS swapping. Previous memory reduction techniques such as pruning and quantization reduce model accuracy and often require retraining. Alternatively, distributed methods partition the convolutions into equivalent smaller sub-computations, but the implementations introduce communication costs and require a network of devices. Distributed partitioning approaches can, however, also be used to run in a reduced memory footprint on a single device by subdividing the network into smaller operations. In this paper, we extend prior work on distributed partitioning into a memory-aware execution on a single device. Our approach extends prior fusing strategies to allow for multiple groups of convolutional layers that are fused and tiled independently. This enables trading off overhead versus data reuse in order to specifically reduces memory footprint. We propose a memory usage predictor coupled with a search algorithm to provide optimized fusing and tiling configurations for an arbitrary set of convolutional layers. When applied to the YOLOv2 object detection network, results show that our approach can run in less than half the memory, and with a speedup of up to 2.78 under severe memory constraints. Additionally, our algorithm will return a configuration with a latency that is within 6% of the best latency measured in a manual search.
翻译:一项新兴的研究挑战是在资源受限的边缘设备上本地运行昂贵的机器学习(ML)网络。具有大型卷积层的ML网络极易超出可用内存,导致操作系统过度交换而增加延迟。先前的内存缩减技术(如剪枝与量化)会降低模型精度且通常需要重新训练。分布式方法可将卷积分解为等效的较小子计算,但其实现引入了通信开销并需要设备网络。然而,分布式分区方法也可通过将网络细分为更小操作,在单设备上以降低的内存占用运行。本文将分布式分区的先前工作扩展为单设备上的内存感知执行。我们的方法拓展了先前的融合策略,允许多组独立融合与分块的卷积层群组。这实现了开销与数据复用之间的权衡,从而专门降低内存占用。我们提出结合搜索算法的内存用量预测器,为任意卷积层集合提供优化的融合与分块配置。当应用于YOLOv2目标检测网络时,结果表明该方法可在低于一半内存的条件下运行,且在严苛内存约束下加速比最高达2.78。此外,我们的算法将返回延迟与人工搜索测得最佳延迟误差在6%以内的配置。