The stereo event-intensity camera setup is widely applied to leverage the advantages of both event cameras with low latency and intensity cameras that capture accurate brightness and texture information. However, such a setup commonly encounters cross-modality parallax that is difficult to be eliminated solely with stereo rectification especially for real-world scenes with complex motions and varying depths, posing artifacts and distortion for existing Event-based Video Frame Interpolation (E-VFI) approaches. To tackle this problem, we propose a novel Stereo Event-based VFI (SE-VFI) network (SEVFI-Net) to generate high-quality intermediate frames and corresponding disparities from misaligned inputs consisting of two consecutive keyframes and event streams emitted between them. Specifically, we propose a Feature Aggregation Module (FAM) to alleviate the parallax and achieve spatial alignment in the feature domain. We then exploit the fused features accomplishing accurate optical flow and disparity estimation, and achieving better interpolated results through flow-based and synthesis-based ways. We also build a stereo visual acquisition system composed of an event camera and an RGB-D camera to collect a new Stereo Event-Intensity Dataset (SEID) containing diverse scenes with complex motions and varying depths. Experiments on public real-world stereo datasets, i.e., DSEC and MVSEC, and our SEID dataset demonstrate that our proposed SEVFI-Net outperforms state-of-the-art methods by a large margin.
翻译:立体事件-强度相机系统被广泛用于结合事件相机低延迟与强度相机捕获精确亮度及纹理信息的优势。然而,此类系统常面临跨模态视差问题,仅靠立体校正难以消除,尤其是在包含复杂运动与深度变化的真实场景中,导致现有基于事件的视频帧插值(E-VFI)方法产生伪影和畸变。为解决这一问题,我们提出了一种新型立体事件视频帧插值网络(SEVFI-Net),能够从由两个连续关键帧及其间事件流组成的未对齐输入中生成高质量中间帧及对应视差。具体而言,我们设计了特征聚合模块(FAM)以缓解视差并实现特征域的空间对齐。进而利用融合特征实现精确的光流与视差估计,并通过基于光流和基于合成的方式获得更优插值结果。我们还搭建了由事件相机与RGB-D相机组成的立体视觉采集系统,构建了包含复杂运动与深度变化场景的新型立体事件-强度数据集(SEID)。在公开真实世界立体数据集(DSEC与MVSEC)及自建SEID上的实验表明,所提SEVFI-Net的性能大幅超越了现有最优方法。