Fused Depthwise Tiling for Memory Optimization in TinyML Deep Neural Network Inference

Memory optimization for deep neural network (DNN) inference gains high relevance with the emergence of TinyML, which refers to the deployment of DNN inference tasks on tiny, low-power microcontrollers. Applications such as audio keyword detection or radar-based gesture recognition are heavily constrained by the limited memory on such tiny devices because DNN inference requires large intermediate run-time buffers to store activations and other intermediate data, which leads to high memory usage. In this paper, we propose a new Fused Depthwise Tiling (FDT) method for the memory optimization of DNNs, which, compared to existing tiling methods, reduces memory usage without inducing any run time overhead. FDT applies to a larger variety of network layers than existing tiling methods that focus on convolutions. It improves TinyML memory optimization significantly by reducing memory of models where this was not possible before and additionally providing alternative design points for models that show high run time overhead with existing methods. In order to identify the best tiling configuration, an end-to-end flow with a new path discovery method is proposed, which applies FDT and existing tiling methods in a fully automated way, including the scheduling of the operations and planning of the layout of buffers in memory. Out of seven evaluated models, FDT achieved significant memory reduction for two models by 76.2% and 18.1% where existing tiling methods could not be applied. Two other models showed a significant run time overhead with existing methods and FDT provided alternative design points with no overhead but reduced memory savings.

翻译：深度神经网络（DNN）推理的内存优化随着TinyML（即在微型低功耗微控制器上部署DNN推理任务）的出现而日益重要。音频关键词检测或基于雷达的手势识别等应用严重受限于此类微型设备的有限内存，因为DNN推理需要较大的中间运行时缓冲区来存储激活值及其他中间数据，导致内存使用量居高不下。本文提出了一种新的融合深度方向分片（FDT）方法用于DNN内存优化，与现有分片方法相比，该方法在不引入任何运行时开销的情况下降低了内存使用。FDT适用于比现有聚焦卷积层的分片方法更广泛的网络层类型。它通过降低此前无法实现内存优化的模型的内存占用，并为现有方法运行时开销较高的模型提供替代设计点，显著提升了TinyML内存优化效果。为确定最佳分片配置，本文提出了一种包含新路径发现方法的端到端流程，该方法以全自动方式应用FDT及现有分片技术，包括操作调度和内存缓冲区布局规划。在评估的七个模型中，FDT为现有分片方法无法应用的两个模型实现了显著的内存降低，分别达76.2%和18.1%；另外两个模型在现有方法下存在显著运行时开销，而FDT通过无开销的替代设计点提供了内存节省（尽管节省幅度降低）。