Tuning-free Visual Effect Transfer across Videos

We present RefVFX, a new framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner. While existing methods excel at prompt-based or keyframe-conditioned editing, they struggle with dynamic temporal effects such as dynamic lighting changes or character transformations, which are difficult to describe via text or static conditions. Transferring a video effect is challenging, as the model must integrate the new temporal dynamics with the input video's existing motion and appearance. % To address this, we introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video depicting the transferred effect. Creating this data is non-trivial, especially the video-to-video effect triplets, which do not exist naturally. To generate these, we propose a scalable automated pipeline that creates high-quality paired videos designed to preserve the input's motion and structure while transforming it based on some fixed, repeatable effect. We then augment this data with image-to-video effects derived from LoRA adapters and code-based temporal effects generated through programmatic composition. Building on our new dataset, we train our reference-conditioned model using recent text-to-video backbones. Experimental results demonstrate that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference. See our website at https://snap-research.github.io/RefVFX/

翻译：我们提出RefVFX，一种以前馈方式将复杂时序特效从参考视频迁移到目标视频或图像的新框架。现有方法虽然在基于提示或关键帧条件的编辑方面表现优异，但难以处理动态时序特效（如动态光照变化或角色变形），这类效果难以通过文本或静态条件进行描述。视频特效迁移具有挑战性，因为模型必须将新的时序动态与输入视频的现有运动及外观相融合。为解决此问题，我们构建了一个大规模三元组数据集，每个三元组包含参考特效视频、输入图像/视频以及展现迁移效果的对应输出视频。此类数据创建难度较大，尤其是视频到视频特效三元组在自然界中并不存在。为此，我们提出可扩展的自动化流程，通过固定可重复的特效转换，在保持输入运动与结构的同时生成高质量配对视频。我们进一步使用从LoRA适配器衍生的图像到视频特效，以及通过程序化合成生成的代码驱动时序特效来增强数据。基于新构建的数据集，我们采用最新文本到视频基础模型训练参考条件化模型。实验结果表明，RefVFX能生成视觉一致且时序连贯的编辑效果，在未见特效类别上具有良好泛化能力，并在定量指标与人工评估中均优于纯提示基线方法。详见我们的项目网站：https://snap-research.github.io/RefVFX/