Recent works in spatiotemporal radiance fields can produce photorealistic free-viewpoint videos. However, they are inherently unsuitable for interactive streaming scenarios (e.g. video conferencing, telepresence) because have an inevitable lag even if the training is instantaneous. This is because these approaches consume videos and thus have to buffer chunks of frames (often seconds) before processing. In this work, we take a step towards interactive streaming via a frame-by-frame approach naturally free of lag. Conventional wisdom believes that per-frame NeRFs are impractical due to prohibitive training costs and storage. We break this belief by introducing Incremental Neural Videos (INV), a per-frame NeRF that is efficiently trained and streamable. We designed INV based on two insights: (1) Our main finding is that MLPs naturally partition themselves into Structure and Color Layers, which store structural and color/texture information respectively. (2) We leverage this property to retain and improve upon knowledge from previous frames, thus amortizing training across frames and reducing redundant learning. As a result, with negligible changes to NeRF, INV can achieve good qualities (>28.6db) in 8min/frame. It can also outperform prior SOTA in 19% less training time. Additionally, our Temporal Weight Compression reduces the per-frame size to 0.3MB/frame (6.6% of NeRF). More importantly, INV is free from buffer lag and is naturally fit for streaming. While this work does not achieve real-time training, it shows that incremental approaches like INV present new possibilities in interactive 3D streaming. Moreover, our discovery of natural information partition leads to a better understanding and manipulation of MLPs. Code and dataset will be released soon.
翻译:时空辐射场的最新研究能够生成逼真的自由视角视频。然而,这些方法本质上不适合交互式流媒体场景(如视频会议、远程呈现),因为即使训练是瞬时的,它们也存在不可避免的延迟。这是因为此类方法需要消费视频,因而必须缓冲几秒的帧块才能处理。本文通过逐帧方法向交互式流媒体迈进一步,该方法自然消除了延迟。传统观点认为,由于训练成本过高且存储需求巨大,逐帧NeRF不切实际。我们引入增量神经视频(INV)打破了这一认知——这是一种可高效训练且支持流式传输的逐帧NeRF。INV的设计基于两个发现:(1)我们主要发现多层感知机(MLP)会自然划分为结构和颜色层,分别存储结构信息与颜色/纹理信息。(2)我们利用这一特性保留并改进先前帧的知识,从而跨帧分摊训练并减少冗余学习。因此,在对NeRF进行微小改动的情况下,INV可在每帧8分钟内实现良好质量(>28.6dB),且训练时间比先前最先进技术(SOTA)减少19%。此外,我们的时间权重压缩将每帧大小降至0.3MB/帧(仅为NeRF的6.6%)。更重要的是,INV不存在缓冲延迟,天然适合流媒体传输。尽管本工作未实现实时训练,但它表明像INV这样的增量方法为交互式3D流媒体开辟了新可能。同时,对自然信息划分的发现有助于更深入地理解和操控MLP。代码与数据集将尽快发布。