We present LightMover, a framework for controllable light manipulation in single images that leverages video diffusion priors to produce physically plausible illumination changes without re-rendering the scene. We formulate light editing as a sequence-to-sequence prediction problem in visual token space: given an image and light-control tokens, the model adjusts light position, color, and intensity together with resulting reflections, shadows, and falloff from a single view. This unified treatment of spatial (movement) and appearance (color, intensity) controls improves both manipulation and illumination understanding. We further introduce an adaptive token-pruning mechanism that preserves spatially informative tokens while compactly encoding non-spatial attributes, reducing control sequence length by 41% while maintaining editing fidelity. To train our framework, we construct a scalable rendering pipeline that generates large numbers of image pairs across varied light positions, colors, and intensities while keeping the scene content consistent with the original image. LightMover enables precise, independent control over light position, color, and intensity, and achieves high PSNR and strong semantic consistency (DINO, CLIP) across different tasks.
翻译:我们提出LightMover框架,用于单张图像中可控的光照编辑,该框架利用视频扩散先验在不重新渲染场景的前提下生成物理合理的照度变化。我们将光照编辑建模为视觉令牌空间中的序列到序列预测问题:给定图像与光照控制令牌,模型从单一视角同步调整光源位置、颜色和强度,及其产生的反射、阴影和衰减效果。这种对空间(移动)和外观(颜色和强度)控制的统一处理提升了操作精度与光照理解能力。我们进一步引入自适应令牌剪枝机制,保留空间信息丰富的令牌同时紧凑编码非空间属性,在维持编辑保真度的前提下将控制序列长度缩减41%。为训练该框架,我们构建了一个可扩展的渲染流水线,在保持场景内容与原图一致的前提下,生成大量覆盖不同光源位置、颜色及强度的图像对。LightMover能够对光源位置、颜色和强度实现精确的独立控制,并在多类任务中取得高PSNR值及强语义一致性(DINO、CLIP)。