Lighting plays a pivotal role in ensuring the naturalness of video generation, significantly influencing the aesthetic quality of the generated content. However, due to the deep coupling between lighting and the temporal features of videos, it remains challenging to disentangle and model independent and coherent lighting attributes, limiting the ability to control lighting in video generation. In this paper, inspired by the established controllable T2I models, we propose LumiSculpt, which, for the first time, enables precise and consistent lighting control in T2V generation models.LumiSculpt equips the video generation with strong interactive capabilities, allowing the input of custom lighting reference image sequences. Furthermore, the core learnable plug-and-play module of LumiSculpt facilitates remarkable control over lighting intensity, position, and trajectory in latent video diffusion models based on the advanced DiT backbone.Additionally, to effectively train LumiSculpt and address the issue of insufficient lighting data, we construct LumiHuman, a new lightweight and flexible dataset for portrait lighting of images and videos. Experimental results demonstrate that LumiSculpt achieves precise and high-quality lighting control in video generation.
翻译:光照在确保视频生成的自然性方面起着关键作用,显著影响着生成内容的美学质量。然而,由于光照与视频时序特征之间的深度耦合,解耦并建模独立且连贯的光照属性仍然具有挑战性,这限制了在视频生成中控制光照的能力。本文受现有可控文本到图像(T2I)模型的启发,提出了LumiSculpt,首次在文本到视频(T2V)生成模型中实现了精确且一致的光照控制。LumiSculpt为视频生成赋予了强大的交互能力,允许输入自定义的光照参考图像序列。此外,LumiSculpt的核心可学习即插即用模块,基于先进的DiT主干网络,在潜在视频扩散模型中实现了对光照强度、位置和轨迹的显著控制。另外,为了有效训练LumiSculpt并解决光照数据不足的问题,我们构建了LumiHuman,这是一个新的轻量级、灵活的图像与视频人像光照数据集。实验结果表明,LumiSculpt在视频生成中实现了精确且高质量的光照控制。