Video stylization, an important downstream task of video generation models, has not yet been thoroughly explored. Its input style conditions typically include text, style image, and stylized first frame. Each condition has a characteristic advantage: text is more flexible, style image provides a more accurate visual anchor, and stylized first frame makes long-video stylization feasible. However, existing methods are largely confined to a single type of style condition, which limits their scope of application. Additionally, their lack of high-quality datasets leads to style inconsistency and temporal flicker. To address these limitations, we introduce DreamStyle, a unified framework for video stylization, supporting (1) text-guided, (2) style-image-guided, and (3) first-frame-guided video stylization, accompanied by a well-designed data curation pipeline to acquire high-quality paired video data. DreamStyle is built on a vanilla Image-to-Video (I2V) model and trained using a Low-Rank Adaptation (LoRA) with token-specific up matrices that reduces the confusion among different condition tokens. Both qualitative and quantitative evaluations demonstrate that DreamStyle is competent in all three video stylization tasks, and outperforms the competitors in style consistency and video quality.
翻译:视频风格化作为视频生成模型的重要下游任务,尚未得到充分探索。其输入风格条件通常包括文本、风格图像和风格化首帧。每种条件都具有其独特优势:文本更为灵活,风格图像提供了更精确的视觉锚点,而风格化首帧则使长视频风格化成为可能。然而,现有方法大多局限于单一类型的风格条件,这限制了其应用范围。此外,高质量数据集的缺失导致风格不一致和时间闪烁问题。为应对这些局限,我们提出了DreamStyle——一个支持(1)文本引导、(2)风格图像引导和(3)首帧引导视频风格化的统一框架,并配备了精心设计的数据处理流程以获取高质量配对视频数据。DreamStyle基于基础图像到视频(I2V)模型构建,采用具有特定令牌上矩阵的低秩自适应(LoRA)进行训练,以减少不同条件令牌间的混淆。定性与定量评估均表明,DreamStyle能胜任所有三种视频风格化任务,并在风格一致性和视频质量方面优于现有方法。