Multi-Level Aggregation and Recursive Alignment Architecture for Efficient Parallel Inference Segmentation Network

Real-time semantic segmentation is a crucial research for real-world applications. However, many methods lay particular emphasis on reducing the computational complexity and model size, while largely sacrificing the accuracy. To tackle this problem, we propose a parallel inference network customized for semantic segmentation tasks to achieve a good trade-off between speed and accuracy. We employ a shallow backbone to ensure real-time speed, and propose three core components to compensate for the reduced model capacity to improve accuracy. Specifically, we first design a dual-pyramidal path architecture (Multi-level Feature Aggregation Module, MFAM) to aggregate multi-level features from the encoder to each scale, providing hierarchical clues for subsequent spatial alignment and corresponding in-network inference. Then, we build Recursive Alignment Module (RAM) by combining the flow-based alignment module with recursive upsampling architecture for accurate spatial alignment between multi-scale feature maps with half the computational complexity of the straightforward alignment method. Finally, we perform independent parallel inference on the aligned features to obtain multi-scale scores, and adaptively fuse them through an attention-based Adaptive Scores Fusion Module (ASFM) so that the final prediction can favor objects of multiple scales. Our framework shows a better balance between speed and accuracy than state-of-the-art real-time methods on Cityscapes and CamVid datasets. We also conducted systematic ablation studies to gain insight into our motivation and architectural design. Code is available at: https://github.com/Yanhua-Zhang/MFARANet.

翻译：实时语义分割在实际应用中是一项关键研究。然而，许多方法过分强调降低计算复杂度和模型规模，却大幅牺牲了精度。为解决这一问题，我们提出了一种专为语义分割任务定制的并行推理网络，旨在实现速度与精度的良好平衡。我们采用浅层主干网络以确保实时速度，并提出三种核心组件来补偿模型容量降低带来的精度损失。具体而言，首先设计了一种双金字塔路径结构（多层级特征聚合模块，MFAM），用于将编码器输出的多层级特征聚合到每个尺度，为后续空间对齐及网络内部推理提供层级线索。然后，通过结合基于流的对齐模块与递归上采样架构构建递归对齐模块（RAM），以仅需直接对齐方法一半的计算复杂度实现多尺度特征图间的精确空间对齐。最后，在对齐后的特征上执行独立并行推理以获得多尺度分数，并通过基于注意力的自适应分数融合模块（ASFM）自适应地融合这些分数，使最终预测能兼顾多尺度目标。在Cityscapes和CamVid数据集上，我们的框架相较于现有最先进实时方法展现出更优的速度与精度平衡。我们还进行了系统性消融实验以深入验证设计动机与架构合理性。代码已开源：https://github.com/Yanhua-Zhang/MFARANet。