Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement

Fully immersive experiences that tightly integrate 6-DoF visual and auditory interaction are essential for virtual and augmented reality. While such experiences can be achieved through computer-generated content, constructing them directly from real-world captured videos remains largely unexplored. We introduce Immersive Volumetric Videos, a new volumetric media format designed to provide large 6-DoF interaction spaces, audiovisual feedback, and high-resolution, high-frame-rate dynamic content. To support IVV construction, we present ImViD, a multi-view, multi-modal dataset built upon a space-oriented capture philosophy. Our custom capture rig enables synchronized multi-view video-audio acquisition during motion, facilitating efficient capture of complex indoor and outdoor scenes with rich foreground--background interactions and challenging dynamics. The dataset provides 5K-resolution videos at 60 FPS with durations of 1-5 minutes, offering richer spatial, temporal, and multimodal coverage than existing benchmarks. Leveraging this dataset, we develop a dynamic light field reconstruction framework built upon a Gaussian-based spatio-temporal representation, incorporating flow-guided sparse initialization, joint camera temporal calibration, and multi-term spatio-temporal supervision for robust and accurate modeling of complex motion. We further propose, to our knowledge, the first method for sound field reconstruction from such multi-view audiovisual data. Together, these components form a unified pipeline for immersive volumetric video production. Extensive benchmarks and immersive VR experiments demonstrate that our pipeline generates high-quality, temporally stable audiovisual volumetric content with large 6-DoF interaction spaces. This work provides both a foundational definition and a practical construction methodology for immersive volumetric videos.

翻译：完全沉浸式体验需紧密整合6自由度视觉与听觉交互，这对虚拟现实和增强现实至关重要。尽管此类体验可通过计算机生成内容实现，但直接从真实世界捕获视频构建仍鲜有探索。我们提出沉浸式体积视频（Immersive Volumetric Videos），这是一种新型体积媒体格式，旨在提供大范围6自由度交互空间、视听反馈及高分辨率高帧率动态内容。为支撑IVV构建，我们提出ImViD数据集——基于空间导向捕获理念构建的多视角多模态数据集。定制化捕获设备支持运动过程中多视角视频-音频同步采集，可高效捕获包含丰富前景-背景交互与复杂动态的室内外场景。该数据集提供5K分辨率、60FPS帧率、时长1-5分钟的视频，相较于现有基准具备更丰富的空间、时间与多模态覆盖。基于此数据集，我们构建了动态光场重建框架，其核心采用基于高斯函数的时空表征，融合流引导稀疏初始化、联合相机时间标定及多目标时空监督，以实现复杂运动的鲁棒精确建模。我们进一步提出——据我们所知——首个基于此类多视角视听数据的声音场重建方法。上述组件共同构成沉浸式体积视频制作的统一流水线。广泛基准测试与沉浸式VR实验表明，我们的流水线可生成具有大范围6自由度交互空间的高质量、时间稳定视听体积内容。本工作为沉浸式体积视频提供了基础定义与实用构建方法论。