Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.
翻译:深度神经网络,特别是基于Transformer的架构,在环境感知的语义分割任务中取得了显著成功。然而,现有模型独立处理视频帧,未能利用时间一致性,而时间一致性可显著提升动态场景中的准确性和稳定性。本文提出一种时空注意力机制,该机制通过扩展Transformer注意力模块以融入多帧上下文信息,从而为视频语义分割构建鲁棒的时间特征表示。我们的方法在保持计算效率且对现有架构改动最小的前提下,将标准自注意力机制修改为可处理时空特征序列。STA机制在多种Transformer架构中展现出广泛的适用性,并在轻量级与大规模模型中均保持有效性。在Cityscapes和BDD100k数据集上的综合评估表明,与单帧基线方法相比,STA在时间一致性指标上实现了9.20个百分点的显著提升,在平均交并比上最高提升了1.76个百分点。这些结果证明STA是一种针对视频语义分割应用的有效架构增强方案。