The excellent text-to-image synthesis capability of diffusion models has driven progress in synthesizing coherent visual stories. The current state-of-the-art method combines the features of historical captions, historical frames, and the current captions as conditions for generating the current frame. However, this method treats each historical frame and caption as the same contribution. It connects them in order with equal weights, ignoring that not all historical conditions are associated with the generation of the current frame. To address this issue, we propose Causal-Story. This model incorporates a local causal attention mechanism that considers the causal relationship between previous captions, frames, and current captions. By assigning weights based on this relationship, Causal-Story generates the current frame, thereby improving the global consistency of story generation. We evaluated our model on the PororoSV and FlintstonesSV datasets and obtained state-of-the-art FID scores, and the generated frames also demonstrate better storytelling in visuals.
翻译:扩散模型卓越的文本到图像合成能力推动了连贯视觉故事合成的研究进展。当前最先进的方法通过结合历史文本描述、历史帧图像及当前文本描述作为条件,生成当前帧图像。然而,该方法将每段历史文本描述与历史帧图像视为同等贡献,并按顺序以等权重连接,忽视了并非所有历史条件都与当前帧生成存在关联。为解决此问题,我们提出因果故事(Causal-Story)模型。该模型引入局部因果注意力机制,通过分析先前文本描述与帧图像及当前文本描述之间的因果关系,依据此关联分配权重生成当前帧,从而提升故事生成的全局一致性。我们在PororoSV与FlintstonesSV数据集上评估了模型性能,获得了最先进的FID分数,同时生成的帧图像在视觉叙事方面表现出更优的故事性。