Recent works on neural network pruning advocate that reducing the depth of the network is more effective in reducing run-time memory usage and accelerating inference latency than reducing the width of the network through channel pruning. In this regard, some recent works propose depth compression algorithms that merge convolution layers. However, the existing algorithms have a constricted search space and rely on human-engineered heuristics. In this paper, we propose a novel depth compression algorithm which targets general convolution operations. We propose a subset selection problem that replaces inefficient activation layers with identity functions and optimally merges consecutive convolution operations into shallow equivalent convolution operations for efficient end-to-end inference latency. Since the proposed subset selection problem is NP-hard, we formulate a surrogate optimization problem that can be solved exactly via two-stage dynamic programming within a few seconds. We evaluate our methods and baselines by TensorRT for a fair inference latency comparison. Our method outperforms the baseline method with higher accuracy and faster inference speed in MobileNetV2 on the ImageNet dataset. Specifically, we achieve $1.41\times$ speed-up with $0.11$\%p accuracy gain in MobileNetV2-1.0 on the ImageNet.
翻译:近期关于神经网络剪枝的研究指出,相较于通过通道剪枝减少网络宽度,减小网络深度在降低运行时内存占用和加速推理延迟方面更为有效。为此,部分近期工作提出了通过融合卷积层实现深度压缩的算法。然而,现有算法存在搜索空间受限、依赖人工设计启发式规则的问题。本文提出了一种针对通用卷积操作的新型深度压缩算法。我们构建了一个子集选择问题:将低效的激活层替换为恒等函数,并优化地将连续卷积操作合并为浅层等效卷积操作,以实现高效的端到端推理延迟。由于该子集选择问题属于NP难问题,我们提出了一个可通过两阶段动态规划在数秒内精确求解的替代优化问题。我们使用TensorRT评估了所提方法与基线模型,以进行公平的推理延迟比较。在ImageNet数据集上的MobileNetV2实验中,我们的方法以更高的准确率和更快的推理速度超越了基线方法。具体而言,在ImageNet数据集上,我们实现了MobileNetV2-1.0模型1.41倍的速度提升,同时准确率提高了0.11个百分点。