User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, significantly enhancing the completeness, flexibility, and efficiency of data capture. The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multi-modal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production.
翻译:用户参与度可通过结合视觉与听觉刺激的完全沉浸式多模态体验得到显著提升。因此,VR/AR技术的下一个前沿在于具备完整场景捕捉、大范围六自由度交互空间、多模态反馈及高分辨率与高帧率内容的沉浸式体视频。为促进沉浸式体视频的重建研究,本文提出ImViD——一个具备完整空间导向数据采集能力并涵盖多样室内外场景的多视角多模态数据集。我们的采集设备支持移动过程中的多视角视频-音频同步采集,这一现有数据集缺乏的能力显著提升了数据采集的完整性、灵活性与效率。所采集的多视角视频(含同步音频)达到5K分辨率与60FPS帧率,时长1-5分钟,包含丰富的前景-背景元素及复杂动态场景。我们基于本数据集对现有方法进行基准测试,并建立了从多视角视听输入构建沉浸式体视频的基础流程,以支持六自由度多模态沉浸式VR体验。基准测试结果及重建与交互实验证明了本数据集与基线方法的有效性,我们相信这将推动沉浸式体视频生成领域的未来研究。