Long video generation requires high-fidelity synthesis, coherent narrative structure, and user control over extended time spans. Existing text-to-video methods often rely on a single long prompt, limiting control over pose, composition, layout, and motion. We propose DrawVideo, a sketch-guided, storyboard-driven framework for controllable long-video generation. DrawVideo decomposes long videos into independently controllable shots, each defined by a black-and-white sketch, an appearance prompt, and a motion prompt. The sketch controls pose and layout, the appearance prompt defines identity, scene, and style, and the motion prompt guides temporal dynamics. DrawVideo follows a hierarchical 'global multi-shot, local single-sketch' strategy: it first generates a structure-aligned reference keyframe, then expands the motion prompt into derivative keyframes representing action states, and finally synthesizes clips between adjacent keyframes to build each shot. We also introduce SketchLongVideo, the first dataset for sketch-guided text-to-long-video generation, constructed from animation videos via shot detection, keyframe extraction, vision-language recognition, prompt decomposition, and sketch conversion. Experiments show that DrawVideo achieves strong structural controllability, appearance consistency, visual stability, and coherent long-video generation.
翻译:摘要:长视频生成要求高保真合成、连贯的叙事结构,并支持用户在长时间跨度上的控制。现有文本到视频方法通常依赖单一长提示词,限制了对手部姿势、构图、布局及运动的控制能力。本文提出DrawVideo——一种基于草图引导与分镜驱动的可控长视频生成框架。该框架将长视频分解为独立可控的镜头,每个镜头由黑白草图、外观提示词和运动提示词定义。其中草图控制姿势与布局,外观提示词定义身份、场景与风格,运动提示词则引导时序动态。DrawVideo遵循“全局多镜头、局部单草图”的层级策略:首先生成结构对齐的参考关键帧,接着将运动提示词扩展为表征动作状态的衍生关键帧,最终通过相邻关键帧间的片段合成构建每个镜头。我们同时提出SketchLongVideo——首个面向草图引导的文本到长视频生成数据集,该数据集通过对动画视频进行镜头检测、关键帧提取、视觉语言识别、提示词分解与草图转换构建而成。实验表明,DrawVideo在结构可控性、外观一致性、视觉稳定性及长视频连贯生成方面均取得优异效果。