LaSe-E2V: Towards Language-guided Semantic-Aware Event-to-Video Reconstruction

Event cameras harness advantages such as low latency, high temporal resolution, and high dynamic range (HDR), compared to standard cameras. Due to the distinct imaging paradigm shift, a dominant line of research focuses on event-to-video (E2V) reconstruction to bridge event-based and standard computer vision. However, this task remains challenging due to its inherently ill-posed nature: event cameras only detect the edge and motion information locally. Consequently, the reconstructed videos are often plagued by artifacts and regional blur, primarily caused by the ambiguous semantics of event data. In this paper, we find language naturally conveys abundant semantic information, rendering it stunningly superior in ensuring semantic consistency for E2V reconstruction. Accordingly, we propose a novel framework, called LaSe-E2V, that can achieve semantic-aware high-quality E2V reconstruction from a language-guided perspective, buttressed by the text-conditional diffusion models. However, due to diffusion models' inherent diversity and randomness, it is hardly possible to directly apply them to achieve spatial and temporal consistency for E2V reconstruction. Thus, we first propose an Event-guided Spatiotemporal Attention (ESA) module to condition the event data to the denoising pipeline effectively. We then introduce an event-aware mask loss to ensure temporal coherence and a noise initialization strategy to enhance spatial consistency. Given the absence of event-text-video paired data, we aggregate existing E2V datasets and generate textual descriptions using the tagging models for training and evaluation. Extensive experiments on three datasets covering diverse challenging scenarios (e.g., fast motion, low light) demonstrate the superiority of our method.

翻译：与标准相机相比，事件相机具备低延迟、高时间分辨率和高动态范围（HDR）等优势。由于成像范式的根本性转变，当前研究的一个主流方向聚焦于事件到视频（E2V）重建，以弥合基于事件的视觉与标准计算机视觉之间的鸿沟。然而，由于该任务本质上具有不适定性——事件相机仅能局部检测边缘和运动信息——重建视频常受伪影和区域模糊的困扰，这主要源于事件数据语义的模糊性。本文发现，语言天然承载丰富的语义信息，在确保E2V重建的语义一致性方面展现出惊人的优越性。据此，我们提出一个名为LaSe-E2V的新框架，该框架能够从语言引导的视角，借助文本条件扩散模型的支持，实现语义感知的高质量E2V重建。然而，由于扩散模型固有的多样性和随机性，直接将其应用于E2V重建以实现时空一致性极为困难。因此，我们首先提出一个事件引导的时空注意力（ESA）模块，以将事件数据有效地注入去噪流程。接着，我们引入一种事件感知掩码损失来确保时间连贯性，以及一种噪声初始化策略来增强空间一致性。鉴于缺乏事件-文本-视频配对数据，我们整合了现有的E2V数据集，并利用标注模型生成文本描述用于训练和评估。在涵盖多种挑战性场景（如快速运动、低光照）的三个数据集上进行的大量实验证明了我们方法的优越性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日