Autonomous systems, such as self-driving cars, rely on reliable semantic environment perception for decision making. Despite great advances in video semantic segmentation, existing approaches ignore important inductive biases and lack structured and interpretable internal representations. In this work, we propose MCDS-VSS, a structured filter model that learns in a self-supervised manner to estimate scene geometry and ego-motion of the camera, while also estimating the motion of external objects. Our model leverages these representations to improve the temporal consistency of semantic segmentation without sacrificing segmentation accuracy. MCDS-VSS follows a prediction-fusion approach in which scene geometry and camera motion are first used to compensate for ego-motion, then residual flow is used to compensate motion of dynamic objects, and finally the predicted scene features are fused with the current features to obtain a temporally consistent scene segmentation. Our model parses automotive scenes into multiple decoupled interpretable representations such as scene geometry, ego-motion, and object motion. Quantitative evaluation shows that MCDS-VSS achieves superior temporal consistency on video sequences while retaining competitive segmentation performance.
翻译:自动驾驶汽车等自主系统依赖于可靠的环境语义感知来进行决策。尽管视频语义分割取得了巨大进展,但现有方法忽略了重要的归纳偏置,且缺乏结构化、可解释的内部表征。在本工作中,我们提出MCDS-VSS,这是一种结构化滤波模型,通过自监督学习来估计场景几何和相机自运动,同时估计外部物体的运动。我们的模型利用这些表征来提升语义分割的时间一致性,且不牺牲分割精度。MCDS-VSS遵循预测-融合的框架:首先利用场景几何和相机运动补偿自运动,然后利用残差流补偿动态物体的运动,最后将预测的场景特征与当前特征融合,以获得时间一致的场景分割。我们的模型将自动驾驶场景解析为多个解耦的可解释表征,如场景几何、自运动和物体运动。定量评估表明,MCDS-VSS在视频序列上实现了优异的时间一致性,同时保持了有竞争力的分割性能。