Storyboard synthesis plays a crucial role in visual storytelling, aiming to generate coherent shot sequences that visually narrate cinematic events with consistent characters, scenes, and transitions. However, existing approaches are mostly adapted from text-to-image diffusion models, which struggle to maintain long-range temporal coherence, consistent character identities, and narrative flow across multiple shots. In this paper, we introduce DreamShot, a video generative model based storyboard framework that fully exploits powerful video diffusion priors for controllable multi-shot synthesis. DreamShot supports both Text-to-Shot and Reference-to-Shot generation, as well as story continuation conditioned on previous frames, enabling flexible and context-aware storyboard generation. By leveraging the spatial-temporal consistency inherent in video generative models, DreamShot produces visually and semantically coherent sequences with improved narrative fidelity and character continuity. Furthermore, DreamShot incorporates a multi-reference role conditioning module that accepts multiple character reference images and enforces identity alignment via a Role-Attention Consistency Loss, explicitly constraining attention between reference and generated roles. Extensive experiments demonstrate that DreamShot achieves superior scene coherence, role consistency, and generation efficiency compared to state-of-the-art text-to-image storyboard models, establishing a new direction toward controllable video model-driven visual storytelling.
翻译:故事板合成在视觉叙事中扮演着关键角色,旨在生成连贯的镜头序列,以可视化方式叙述电影事件,并保持角色、场景和转场的一致性。然而,现有方法多改编自文本到图像扩散模型,难以在多个镜头间维持长程时序连贯性、一致的角色身份以及叙事流。本文提出DreamShot——一种基于视频生成模型的故事板框架,充分利用强大的视频扩散先验,实现可控的多镜头合成。DreamShot支持文本到镜头与参考到镜头两种生成模式,并能基于前序帧进行故事延续,支持灵活且具上下文感知的故事板生成。通过利用视频生成模型固有的时空一致性,DreamShot生成的序列在视觉与语义上均保持连贯,提升了叙事保真度与角色连续性。此外,DreamShot集成了多参考角色条件模块,可接受多张角色参考图像,并通过角色注意力一致性损失强制身份对齐,显式约束参考与生成角色之间的注意力。大量实验表明,与最先进的文本到图像故事板模型相比,DreamShot在场景连贯性、角色一致性与生成效率上均实现了更优性能,为可控视频模型驱动的视觉叙事开辟了新方向。