Neural rendering is fuelling a unification of learning, 3D geometry and video understanding that has been waiting for more than two decades. Progress, however, is still hampered by a lack of suitable datasets and benchmarks. To address this gap, we introduce EPIC Fields, an augmentation of EPIC-KITCHENS with 3D camera information. Like other datasets for neural rendering, EPIC Fields removes the complex and expensive step of reconstructing cameras using photogrammetry, and allows researchers to focus on modelling problems. We illustrate the challenge of photogrammetry in egocentric videos of dynamic actions and propose innovations to address them. Compared to other neural rendering datasets, EPIC Fields is better tailored to video understanding because it is paired with labelled action segments and the recent VISOR segment annotations. To further motivate the community, we also evaluate two benchmark tasks in neural rendering and segmenting dynamic objects, with strong baselines that showcase what is not possible today. We also highlight the advantage of geometry in semi-supervised video object segmentations on the VISOR annotations. EPIC Fields reconstructs 96% of videos in EPICKITCHENS, registering 19M frames in 99 hours recorded in 45 kitchens.
翻译:神经渲染正在推动学习、三维几何与视频理解超过二十年来一直期待的统一进程。然而,仍缺乏合适的数据集和基准测试阻碍了进展。为填补这一空白,我们提出EPIC Fields,即对EPIC-KITCHENS数据集进行三维相机信息增强的版本。与其他用于神经渲染的数据集类似,EPIC Fields省去了利用摄影测量重建相机这一复杂且昂贵的步骤,使研究者能够专注于建模问题。我们阐述了动态动作自我中心视频中摄影测量面临的挑战,并提出了应对这些挑战的创新方法。与其他神经渲染数据集相比,EPIC Fields配备了标注的动作片段及近期发布的VISOR分割标注,因此更适合视频理解任务。为激励社区发展,我们还评估了神经渲染与动态目标分割两项基准任务,并提供了强基线方法,揭示了当前技术尚不能实现的内容。此外,我们展示了三维几何在基于VISOR标注的半监督视频目标分割中的优势。EPIC Fields成功重建了EPIC-KITCHENS中96%的视频,在45间厨房记录的99小时视频中注册了1900万帧画面。