Recent advances in semantic segmentation rely heavily on attention-based and transformer-style architectures that, while accurate, introduce considerable architectural complexity and computational cost. This paper asks whether a compact CNN-based segmentation head can remain competitive by adaptively selecting useful receptive-field evidence. We propose ATV-Net, an Adaptive Triple-View Network that attaches a lightweight head to a conventional backbone. The head organizes three complementary views -- point-wise, neighborhood-level, and enlarged context -- and fuses them through an Adaptive Decision Gate that generates image-dependent weights from global feature statistics. This allows the model to emphasize different receptive-field responses according to scene content, without dense attention or multi-scale aggregation. Experiments on Cityscapes and Pascal VOC 2012 show that ATV-Net achieves 80.31% mIoU on Cityscapes with ResNet-101 and 80.90% with ConvNeXt-Tiny, and 86.7% and 88.5% mIoU on Pascal VOC 2012, respectively, while requiring fewer GFLOPs than representative context-aggregation and attention-based heads. The results indicate that adaptive receptive-field selection remains a practical and effective design choice for CNN-based semantic segmentation.
翻译:近年来语义分割的进展严重依赖基于注意力机制和Transformer架构的方法,这些方法虽然准确,但引入了显著的结构复杂性和计算成本。本文探讨了紧凑型CNN分割头能否通过自适应选择有效的感受野证据来保持竞争力。我们提出ATV-Net(自适应三视角网络),该网络在传统骨干网络后附加一个轻量级分割头。该分割头组织三个互补视角——逐点级、邻域级和扩展上下文——并通过自适应决策门融合它们,该门从全局特征统计中生成图像相关的权重。这使得模型能够根据场景内容强调不同的感受野响应,无需密集注意力或多尺度聚合。在Cityscapes和Pascal VOC 2012上的实验表明,ATV-Net在使用ResNet-101时在Cityscapes上达到80.31%的mIoU,使用ConvNeXt-Tiny时达到80.90%;在Pascal VOC 2012上分别达到86.7%和88.5%的mIoU,同时所需的GFLOPs少于代表性的上下文聚合和基于注意力的分割头。结果表明,自适应感受野选择仍然是CNN语义分割中实用且有效的设计选择。