Story continuation focuses on generating the next image in a narrative sequence so that it remains coherent with both the ongoing text description and the previously observed images. A central challenge in this setting lies in utilizing prior visual context effectively, while ensuring semantic alignment with the current textual input. In this work, we introduce AVC (Adaptive Visual Conditioning), a framework for diffusion-based story continuation. AVC employs the CLIP model to retrieve the most semantically aligned image from previous frames. Crucially, when no sufficiently relevant image is found, AVC adaptively restricts the influence of prior visuals to only the early stages of the diffusion process. This enables the model to exploit visual context when beneficial, while avoiding the injection of misleading or irrelevant information. Furthermore, we improve data quality by re-captioning a noisy dataset using large language models, thereby strengthening textual supervision and semantic alignment. Quantitative results and human evaluations demonstrate that AVC achieves superior coherence, semantic consistency, and visual fidelity compared to strong baselines, particularly in challenging cases where prior visuals conflict with the current input.
翻译:故事续写任务旨在生成叙事序列中的下一幅图像,使其在语义上与持续发展的文本描述及先前观察到的图像保持连贯。该任务的核心挑战在于如何有效利用先前的视觉上下文,同时确保与当前文本输入在语义上对齐。本文提出了AVC(自适应视觉条件化),一种用于基于扩散模型的故事续写的框架。AVC利用CLIP模型从先前帧中检索语义最对齐的图像。关键的是,当未找到足够相关的图像时,AVC会自适应地将先前视觉信息的影响限制在扩散过程的早期阶段。这使得模型能够在有益时利用视觉上下文,同时避免引入误导性或无关的信息。此外,我们通过使用大语言模型对噪声数据集进行重新标注,提升了数据质量,从而增强了文本监督和语义对齐。定量结果和人工评估表明,与强基线方法相比,AVC在连贯性、语义一致性和视觉保真度方面均取得了更优的性能,尤其是在先前视觉信息与当前输入存在冲突的挑战性案例中。