Two-branch network architecture has shown its efficiency and effectiveness in real-time semantic segmentation tasks. However, direct fusion of high-resolution details and low-frequency context has the drawback of detailed features being easily overwhelmed by surrounding contextual information. This overshoot phenomenon limits the improvement of the segmentation accuracy of existing two-branch models. In this paper, we make a connection between Convolutional Neural Networks (CNN) and Proportional-Integral-Derivative (PID) controllers and reveal that a two-branch network is equivalent to a Proportional-Integral (PI) controller, which inherently suffers from similar overshoot issues. To alleviate this problem, we propose a novel three-branch network architecture: PIDNet, which contains three branches to parse detailed, context and boundary information, respectively, and employs boundary attention to guide the fusion of detailed and context branches. Our family of PIDNets achieve the best trade-off between inference speed and accuracy and their accuracy surpasses all the existing models with similar inference speed on the Cityscapes and CamVid datasets. Specifically, PIDNet-S achieves 78.6% mIOU with inference speed of 93.2 FPS on Cityscapes and 80.1% mIOU with speed of 153.7 FPS on CamVid.
翻译:双分支网络架构已在实时语义分割任务中展现出其高效性和有效性。然而,高分率细节与低频上下文的直接融合存在一个弊端:细节特征容易被周围的上下文信息淹没。这种过冲现象限制了现有双分支模型的分割精度提升。本文建立了卷积神经网络(CNN)与比例积分微分(PID)控制器之间的联系,揭示出双分支网络等价于比例积分(PI)控制器,因而固有地存在类似的过冲问题。为缓解该问题,我们提出一种新颖的三分支网络架构:PIDNet,它包含三个分支分别解析细节、上下文和边界信息,并采用边界注意力引导细节分支与上下文分支的融合。我们的PIDNet系列在推理速度与精度之间实现了最佳权衡,在Cityscapes和CamVid数据集上,其精度超越了所有推理速度相近的现有模型。具体而言,PIDNet-S在Cityscapes上以93.2 FPS的推理速度达到78.6%的mIOU,在CamVid上以153.7 FPS的推理速度达到80.1%的mIOU。