Domain Generalized Video Semantic Segmentation (DGVSS) is trained on a single labeled driving domain and is directly deployed on unseen domains without target labels and test-time adaptation while maintaining temporally consistent predictions over video streams. In practice, both domain shift and temporal-sampling shift break correspondence-based propagation and fixed-stride temporal aggregation, causing severe frame-to-frame flicker even in label-stable regions. We propose Time2General, a DGVSS framework built on Stability Queries. Time2General introduces a Spatio-Temporal Memory Decoder that aggregates multi-frame context into a clip-level spatio-temporal memory and decodes temporally consistent per-frame masks without explicit correspondence propagation. To further suppress flicker and improve robustness to varying sampling rates, the Masked Temporal Consistency Loss is proposed to regularize temporal prediction discrepancies across different strides, and randomize training strides to expose the model to diverse temporal gaps. Extensive experiments on multiple driving benchmarks show that Time2General achieves a substantial improvement in cross-domain accuracy and temporal stability over prior DGSS and VSS baselines while running at up to 18 FPS. Code will be released after the review process.
翻译:领域泛化视频语义分割(DGVSS)在单个带标注的驾驶领域上进行训练,无需目标域标签和测试时适配即可直接部署于未见领域,同时保持视频流上预测的时间一致性。实践中,领域偏移与时间采样偏移均会破坏基于对应关系的传播机制与固定步长的时间聚合,导致即使在标签稳定区域也出现严重的帧间闪烁现象。我们提出Time2General,一个基于稳定性查询构建的DGVSS框架。该框架引入时空记忆解码器,将多帧上下文聚合为片段级时空记忆,并解码得到时间一致的逐帧掩码,无需显式的对应关系传播。为进一步抑制闪烁并提升对可变采样率的鲁棒性,我们提出掩码时间一致性损失,通过正则化不同步长下的时序预测差异,并随机化训练步长使模型接触多样化的时间间隔,实现对时序一致性的强化约束。在多个驾驶基准测试上的大量实验表明,Time2General在跨领域准确率和时间稳定性方面较现有DGSS与VSS基线方法取得显著提升,同时运行速度可达18 FPS。代码将在评审结束后公开。