Efficient Context Integration through Factorized Pyramidal Learning for Ultra-Lightweight Semantic Segmentation

Semantic segmentation is a pixel-level prediction task to classify each pixel of the input image. Deep learning models, such as convolutional neural networks (CNNs), have been extremely successful in achieving excellent performances in this domain. However, mobile application, such as autonomous driving, demand real-time processing of incoming stream of images. Hence, achieving efficient architectures along with enhanced accuracy is of paramount importance. Since, accuracy and model size of CNNs are intrinsically contentious in nature, the challenge is to achieve a decent trade-off between accuracy and model size. To address this, we propose a novel Factorized Pyramidal Learning (FPL) module to aggregate rich contextual information in an efficient manner. On one hand, it uses a bank of convolutional filters with multiple dilation rates which leads to multi-scale context aggregation; crucial in achieving better accuracy. On the other hand, parameters are reduced by a careful factorization of the employed filters; crucial in achieving lightweight models. Moreover, we decompose the spatial pyramid into two stages which enables a simple and efficient feature fusion within the module to solve the notorious checkerboard effect. We also design a dedicated Feature-Image Reinforcement (FIR) unit to carry out the fusion operation of shallow and deep features with the downsampled versions of the input image. This gives an accuracy enhancement without increasing model parameters. Based on the FPL module and FIR unit, we propose an ultra-lightweight real-time network, called FPLNet, which achieves state-of-the-art accuracy-efficiency trade-off. More specifically, with only less than 0.5 million parameters, the proposed network achieves 66.93\% and 66.28\% mIoU on Cityscapes validation and test set, respectively. Moreover, FPLNet has a processing speed of 95.5 frames per second (FPS).

翻译：语义分割是一项像素级预测任务，旨在对输入图像的每个像素进行分类。深度学习模型，如卷积神经网络（CNN），在该领域已取得极为优异的性能表现。然而，自动驾驶等移动应用要求对输入图像流进行实时处理。因此，在提升精度的同时实现高效架构至关重要。由于CNN模型的精度与参数量本质上相互制约，关键挑战在于实现精度与模型规模间的合理权衡。为此，我们提出一种新颖的因子化金字塔学习（FPL）模块，以高效方式聚合丰富的上下文信息。一方面，该模块利用多膨胀率卷积滤波器组实现多尺度上下文聚合，这对提升精度至关重要；另一方面，通过精心设计的滤波器分解策略减少参数量，这对实现轻量化模型至关重要。此外，我们将空间金字塔分解为两阶段处理，使模块内实现简单高效的特征融合以解决棘手的棋盘效应问题。我们还设计了专用特征图像强化（FIR）单元，用于将浅层与深层特征与输入图像的下采样版本进行融合操作。该设计可在不增加模型参数的前提下提升精度。基于FPL模块与FIR单元，我们提出超轻量实时网络FPLNet，实现了精度-效率的最佳平衡。具体而言，该网络仅用不到50万个参数，即在Cityscapes验证集和测试集上分别达到66.93%和66.28%的mIoU。此外，FPLNet的处理速度达95.5帧/秒（FPS）。