Moving object segmentation (MOS) in dynamic scenes is challenging for autonomous driving, especially for sequences obtained from moving ego vehicles. Most state-of-the-art methods leverage motion cues obtained from optical flow maps. However, since these methods are often based on optical flows that are pre-computed from successive RGB frames, this neglects the temporal consideration of events occurring within inter-frame and limits the practicality of these methods in real-life situations. To address these limitations, we propose to exploit event cameras for better video understanding, which provide rich motion cues without relying on optical flow. To foster research in this area, we first introduce a novel large-scale dataset called DSEC-MOS for moving object segmentation from moving ego vehicles. Subsequently, we devise EmoFormer, a novel network able to exploit the event data. For this purpose, we fuse the event prior with spatial semantic maps to distinguish moving objects from the static background, adding another level of dense supervision around our object of interest - moving ones. Our proposed network relies only on event data for training but does not require event input during inference, making it directly comparable to frame-only methods in terms of efficiency and more widely usable in many application cases. An exhaustive comparison with 8 state-of-the-art video object segmentation methods highlights a significant performance improvement of our method over all other methods. Project Page: https://github.com/ZZY-Zhou/DSEC-MOS.
翻译:动态场景下的运动物体分割对自动驾驶而言极具挑战性,尤其针对自主运动载体采集的序列数据。当前主流方法多依赖从光流图中提取运动线索,但此类方法基于连续RGB帧预计算的光流,忽略了帧间事件的时间特性,限制了其在实际场景中的实用性。针对上述局限,我们提出利用事件相机增强视频语义理解——该传感器无需依赖光流即可提供丰富的运动线索。为促进该领域研究,我们率先构建了面向运动载体场景的大规模数据集DSEC-MOS。继而设计新型网络EmoFormer,通过融合事件先验与空间语义图来区分运动物体与静态背景,并在目标区域(运动物体)周围引入密集监督。该网络仅需事件数据训练,推理阶段无需事件输入,因此在效率上可与纯帧方法比肩,并适用于更广泛的应用场景。与八种主流视频目标分割方法的全面对比表明,本方法在性能上显著超越现有技术。项目主页:https://github.com/ZZY-Zhou/DSEC-MOS。