Adaptive Visual Conditioning for Semantic Consistency in Diffusion-Based Story Continuation

Story continuation focuses on generating the next image in a narrative sequence so that it remains coherent with both the ongoing text description and the previously observed images. A central challenge in this setting lies in utilizing prior visual context effectively, while ensuring semantic alignment with the current textual input. In this work, we introduce AVC (Adaptive Visual Conditioning), a framework for diffusion-based story continuation. AVC employs the CLIP model to retrieve the most semantically aligned image from previous frames. Crucially, when no sufficiently relevant image is found, AVC adaptively restricts the influence of prior visuals to only the early stages of the diffusion process. This enables the model to exploit visual context when beneficial, while avoiding the injection of misleading or irrelevant information. Furthermore, we improve data quality by re-captioning a noisy dataset using large language models, thereby strengthening textual supervision and semantic alignment. Quantitative results and human evaluations demonstrate that AVC achieves superior coherence, semantic consistency, and visual fidelity compared to strong baselines, particularly in challenging cases where prior visuals conflict with the current input.

翻译：故事续写任务旨在生成叙事序列中的下一幅图像，使其在语义上与持续发展的文本描述及先前观察到的图像保持连贯。该任务的核心挑战在于如何有效利用先前的视觉上下文，同时确保与当前文本输入在语义上对齐。本文提出了AVC（自适应视觉条件化），一种用于基于扩散模型的故事续写的框架。AVC利用CLIP模型从先前帧中检索语义最对齐的图像。关键的是，当未找到足够相关的图像时，AVC会自适应地将先前视觉信息的影响限制在扩散过程的早期阶段。这使得模型能够在有益时利用视觉上下文，同时避免引入误导性或无关的信息。此外，我们通过使用大语言模型对噪声数据集进行重新标注，提升了数据质量，从而增强了文本监督和语义对齐。定量结果和人工评估表明，与强基线方法相比，AVC在连贯性、语义一致性和视觉保真度方面均取得了更优的性能，尤其是在先前视觉信息与当前输入存在冲突的挑战性案例中。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日