Event-based semantic segmentation (ESS) is a fundamental yet challenging task for event camera sensing. The difficulties in interpreting and annotating event data limit its scalability. While domain adaptation from images to event data can help to mitigate this issue, there exist data representational differences that require additional effort to resolve. In this work, for the first time, we synergize information from image, text, and event-data domains and introduce OpenESS to enable scalable ESS in an open-world, annotation-efficient manner. We achieve this goal by transferring the semantically rich CLIP knowledge from image-text pairs to event streams. To pursue better cross-modality adaptation, we propose a frame-to-event contrastive distillation and a text-to-event semantic consistency regularization. Experimental results on popular ESS benchmarks showed our approach outperforms existing methods. Notably, we achieve 53.93% and 43.31% mIoU on DDD17 and DSEC-Semantic without using either event or frame labels.
翻译:基于事件的语义分割(ESS)是利用事件相机感知的一项基础但具有挑战性的任务。事件数据解释与标注的困难限制了其可扩展性。尽管从图像到事件数据的域适应有助于缓解这一问题,但两者存在的数据表征差异仍需额外努力解决。在本工作中,我们首次协同图像、文本和事件数据域的信息,引入OpenESS方法,以开放世界、高效标注的方式实现可扩展的ESS。我们通过将图像-文本对中语义丰富的CLIP知识迁移至事件流来实现这一目标。为追求更优的跨模态适应性,我们提出了帧到事件的对比蒸馏机制以及文本到事件的语义一致性正则化方法。在主流ESS基准测试上的实验结果表明,我们的方法优于现有技术。值得注意的是,在DDD17和DSEC-Semantic数据集上,我们无需使用任何事件或帧标签便分别取得了53.93%和43.31%的mIoU。