Scene understanding with free-form language has been widely explored within diverse modalities such as images, point clouds, and LiDAR. However, related studies on event sensors are scarce or narrowly centered on semantic-level understanding. We introduce SEAL, the first Semantic-aware Segment Any Events framework that addresses Open-Vocabulary Event Instance Segmentation (OV-EIS). Given the visual prompt, our model presents a unified framework to support both event segmentation and open-vocabulary mask classification at multiple levels of granularity, including instance-level and part-level. To enable thorough evaluation on OV-EIS, we curate four benchmarks that cover label granularity from coarse to fine class configurations and semantic granularity from instance-level to part-level understanding. Extensive experiments show that our SEAL largely outperforms proposed baselines in terms of performance and inference speed with a parameter-efficient architecture. In the Appendix, we further present a simple variant of our SEAL achieving generic spatiotemporal OV-EIS that does not require any visual prompts from users in the inference. Check out our project page in https://0nandon.github.io/SEAL
翻译:利用自由形式语言进行场景理解已在图像、点云和激光雷达等多种模态中得到广泛探索。然而,关于事件传感器的相关研究却十分稀少,或仅狭隘地集中于语义层面的理解。我们提出了SEAL,这是首个语义感知的任意事件分割框架,旨在解决开放词汇事件实例分割问题。给定视觉提示,我们的模型提供了一个统一框架,支持在多个粒度级别上进行事件分割和开放词汇掩码分类,包括实例级和部件级。为了实现对OV-EIS的全面评估,我们构建了四个基准测试集,涵盖从粗到细的类别配置标签粒度,以及从实例级到部件级理解的语义粒度。大量实验表明,我们的SEAL在性能和推理速度方面大幅优于所提出的基线模型,且具有参数高效的架构。在附录中,我们进一步提出了SEAL的一个简单变体,该变体实现了通用的时空OV-EIS,在推理过程中无需用户提供任何视觉提示。请访问我们的项目页面:https://0nandon.github.io/SEAL