Relit-LiVE: Relight Video by Jointly Learning Environment Video

Recent advances have shown that large-scale video diffusion models can be repurposed as neural renderers by first decomposing videos into intrinsic scene representations and then performing forward rendering under novel illumination. While promising, this paradigm fundamentally relies on accurate intrinsic decomposition, which remains highly unreliable for real-world videos and often leads to distorted appearances, broken materials, and accumulated temporal artifacts during relighting. In this work, we present Relit-LiVE, a novel video relighting framework that produces physically consistent, temporally stable results without requiring prior knowledge of camera pose. Our key insight is to explicitly introduce raw reference images into the rendering process, enabling the model to recover critical scene cues that are inevitably lost or corrupted in intrinsic representations. Furthermore, we propose a novel environment video prediction formulation that simultaneously generates relit videos and per-frame environment maps aligned with each camera viewpoint in a single diffusion process. This joint prediction enforces strong geometric-illumination alignment and naturally supports dynamic lighting and camera motion, significantly improving physical consistency in video relighting while easing the requirement of known per-frame camera pose. Extensive experiments demonstrate that Relit-LiVE consistently outperforms state-of-the-art video relighting and neural rendering methods across synthetic and real-world benchmarks. Beyond relighting, our framework naturally supports a wide range of downstream applications, including scene-level rendering, material editing, object insertion, and streaming video relighting. The Project is available at https://github.com/zhuxing0/Relit-LiVE.

翻译：近期研究表明，通过先将视频分解为内在场景表征再在新光照下进行前向渲染，大规模视频扩散模型可被重新用作神经渲染器。尽管这种方法前景广阔，但其根本上依赖于精确的内在分解，而这对于真实世界视频而言仍极不可靠，常导致重打光过程中出现扭曲外观、材质断裂及时间伪影累积。本文提出Relit-LiVE——一种无需相机姿态先验知识即可生成物理一致、时间稳定结果的新型视频重打光框架。我们的关键洞见在于显式引入原始参考图像到渲染流程中，使模型能够恢复内在表征中不可避免丢失或受损的关键场景线索。此外，我们提出新颖的环境视频预测方案，在单一扩散过程中同步生成重打光视频及与各相机视角对齐的逐帧环境贴图。这种联合预测强化了几何与光照的对齐，自然支持动态照明与相机运动，在显著提升视频重打光物理一致性的同时降低了对已知逐帧相机姿态的要求。大量实验表明，Relit-LiVE在合成与真实世界基准测试中持续超越当前最优的视频重打光及神经渲染方法。除重打光外，本框架天然支持场景级渲染、材质编辑、物体插入及流式视频重打光等广泛下游应用。项目链接：https://github.com/zhuxing0/Relit-LiVE。