Recent research showcases the considerable potential of conditional diffusion models for generating consistent stories. However, current methods, which predominantly generate stories in an autoregressive and excessively caption-dependent manner, often underrate the contextual consistency and relevance of frames during sequential generation. To address this, we propose a novel Rich-contextual Conditional Diffusion Models (RCDMs), a two-stage approach designed to enhance story generation's semantic consistency and temporal consistency. Specifically, in the first stage, the frame-prior transformer diffusion model is presented to predict the frame semantic embedding of the unknown clip by aligning the semantic correlations between the captions and frames of the known clip. The second stage establishes a robust model with rich contextual conditions, including reference images of the known clip, the predicted frame semantic embedding of the unknown clip, and text embeddings of all captions. By jointly injecting these rich contextual conditions at the image and feature levels, RCDMs can generate semantic and temporal consistency stories. Moreover, RCDMs can generate consistent stories with a single forward inference compared to autoregressive models. Our qualitative and quantitative results demonstrate that our proposed RCDMs outperform in challenging scenarios. The code and model will be available at https://github.com/muzishen/RCDMs.
翻译:近期研究表明,条件扩散模型在生成一致性故事方面具有巨大潜力。然而,现有方法主要采用自回归且过度依赖字幕描述的方式生成故事序列,往往低估了序列生成过程中帧间上下文一致性与关联性。为此,我们提出一种新颖的丰富上下文条件扩散模型(RCDMs),该两阶段方法旨在增强故事生成的语义一致性与时序一致性。具体而言,在第一阶段,我们提出帧先验Transformer扩散模型,通过对齐已知片段字幕与帧间的语义关联,预测未知片段的帧语义嵌入。第二阶段构建具有丰富上下文条件的鲁棒模型,其条件包括已知片段的参考图像、预测的未知片段帧语义嵌入以及所有字幕的文本嵌入。通过在图像级和特征级联合注入这些丰富的上下文条件,RCDMs能够生成语义与时序一致的故事序列。此外,相较于自回归模型,RCDMs仅需单次前向推理即可生成一致性故事。定性与定量实验结果表明,我们所提出的RCDMs在复杂场景中表现优异。代码与模型将在https://github.com/muzishen/RCDMs发布。