Event cameras provide a promising sensing modality for high-speed and high-dynamic-range vision by asynchronously capturing brightness changes. A fundamental task in event-based vision is event-to-video (E2V) reconstruction, which aims to recover intensity videos from event streams. Most existing E2V approaches formulate reconstruction as a temporal--spatial signal recovery problem, relying on temporal aggregation and spatial feature learning to infer intensity frames. While effective to some extent, this formulation overlooks a critical limitation of event data: due to the change-driven sensing mechanism, event streams are inherently semantically under-determined, lacking object-level structure and contextual information that are essential for faithful reconstruction. In this work, we revisit E2V from a semantic perspective and argue that effective reconstruction requires going beyond temporal and spatial modeling to explicitly account for missing semantic information. Based on this insight, we propose \textit{Semantic-E2VID}, a semantic-enriched end-to-end E2V framework that reformulates reconstruction as a process of semantic learning, fusing and decoding. Our approach first performs semantic abstraction by bridging event representations with semantics extracted from a pretrained Segment Anything Model (SAM), while avoiding modality-induced feature drift. The learned semantics are then fused into the event latent space in a representation-compatible manner, enabling event features to capture object-level structure and contextual cues. Furthermore, semantic-aware supervision is introduced to explicitly guide the reconstruction process toward semantically meaningful regions, complementing conventional pixel-level and temporal objectives. Extensive experiments on six public benchmarks demonstrate that Semantic-E2VID consistently outperforms state-of-the-art E2V methods.
翻译:事件相机通过异步捕获亮度变化,为高速和高动态范围视觉提供了一种前景广阔的传感模态。事件视觉中的一个基础任务是事件到视频(E2V)重建,其目标是从事件流中恢复强度视频。大多数现有E2V方法将重建表述为一个时空信号恢复问题,依赖时间聚合和空间特征学习来推断强度帧。尽管在一定程度上有效,但这种表述忽略了事件数据的一个关键局限:由于变化驱动的传感机制,事件流本质上是语义欠定的,缺乏对于忠实重建至关重要的对象级结构和上下文信息。在本工作中,我们从语义视角重新审视E2V,并论证有效的重建需要超越时间和空间建模,以显式地补偿缺失的语义信息。基于此洞见,我们提出了\textit{Semantic-E2VID},一个语义增强的端到端E2V框架,它将重建重新表述为一个语义学习、融合和解码的过程。我们的方法首先通过将事件表征与从预训练的Segment Anything Model(SAM)中提取的语义进行桥接,来执行语义抽象,同时避免模态引起的特征漂移。学习到的语义随后以表征兼容的方式融合到事件潜在空间中,使事件特征能够捕获对象级结构和上下文线索。此外,我们引入了语义感知监督,以显式地引导重建过程朝向具有语义意义的区域,从而补充了传统的像素级和时间目标。在六个公开基准上进行的大量实验表明,Semantic-E2VID始终优于最先进的E2V方法。