Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion. To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.
翻译:运动实例分割(MIS)因在交通监控、自动驾驶及动物追踪等领域的广泛应用而日益受到关注。事件相机记录异步亮度变化,具备高时间分辨率与高动态范围,因而对运动信息极为敏感。通过融合事件与图像特征,事件中的运动线索能够补充图像中的空间细节,从而提升运动实例分割性能。然而,当前多模态运动实例分割方法仍难以分割小型运动实例,这是因为事件相机在有限分辨率下常产生稀疏特征。此外,事件特征将外观属性与运动线索纠缠在一起,进一步限制了有效的跨模态融合。为解决上述挑战,我们首先提出一种双解耦特征提取框架,该框架在图像与事件模态中分别分离并提取外观与运动信息,从而提升特征密度。随后,引入多粒度跨模态对齐机制,以对齐跨模态间分布一致且语义一致的特征,实现富含空间与时间细节的更有效融合。实验结果表明,我们的方法在多模态运动实例分割任务中达到了最先进性能,尤其在快速运动与低光照等挑战性场景下的小型实例分割中表现优异。