Representing human performance at high-fidelity is an essential building block in diverse applications, such as film production, computer games or videoconferencing. To close the gap to production-level quality, we introduce HumanRF, a 4D dynamic neural scene representation that captures full-body appearance in motion from multi-view video input, and enables playback from novel, unseen viewpoints. Our novel representation acts as a dynamic video encoding that captures fine details at high compression rates by factorizing space-time into a temporal matrix-vector decomposition. This allows us to obtain temporally coherent reconstructions of human actors for long sequences, while representing high-resolution details even in the context of challenging motion. While most research focuses on synthesizing at resolutions of 4MP or lower, we address the challenge of operating at 12MP. To this end, we introduce ActorsHQ, a novel multi-view dataset that provides 12MP footage from 160 cameras for 16 sequences with high-fidelity, per-frame mesh reconstructions. We demonstrate challenges that emerge from using such high-resolution data and show that our newly introduced HumanRF effectively leverages this data, making a significant step towards production-level quality novel view synthesis.
翻译:以高保真度呈现人体表演是电影制作、电子游戏或视频会议等多种应用中的关键构建模块。为缩小与工业级品质的差距,我们提出HumanRF——一种从多视角视频输入中捕捉全身运动外观、并支持从新视角回放的4D动态神经场景表示。该新颖表示通过将时空分解为时间矩阵-向量分解,以高压缩率保留精细细节,从而实现对长序列中人体演员的时序连贯重建,即使在复杂运动场景中也能呈现高分辨率细节。尽管现有研究大多聚焦于4MP或更低分辨率的合成,我们挑战了12MP分辨率的操作难题。为此,我们提出ActorsHQ——一个包含来自160个摄像头的16个序列的12MP多视角数据集,并提供每帧高保真网格重建。我们揭示了采用此类高分辨率数据所引发的挑战,并证明新提出的HumanRF能有效利用该数据,向工业级品质的新视角合成迈出关键一步。