Story Visualization (SV) is a challenging generative vision task, that requires both visual quality and consistency between different frames in generated image sequences. Previous approaches either employ some kind of memory mechanism to maintain context throughout an auto-regressive generation of the image sequence, or model the generation of the characters and their background separately, to improve the rendering of characters. On the contrary, we embrace a completely parallel transformer-based approach, exclusively relying on Cross-Attention with past and future captions to achieve consistency. Additionally, we propose a Character Guidance technique to focus on the generation of characters in an implicit manner, by forming a combination of text-conditional and character-conditional logits in the logit space. We also employ a caption-augmentation technique, carried out by a Large Language Model (LLM), to enhance the robustness of our approach. The combination of these methods culminates into state-of-the-art (SOTA) results over various metrics in the most prominent SV benchmark (Pororo-SV), attained with constraint resources while achieving superior computational complexity compared to previous arts. The validity of our quantitative results is supported by a human survey.
翻译:故事可视化是一项具有挑战性的生成式视觉任务,要求生成的图像序列既具备视觉质量,又保持帧间一致性。以往的方法要么采用某种记忆机制通过自回归生成图像序列来维持上下文,要么将角色与其背景分开建模以改善角色渲染效果。相反,我们采用完全并行的Transformer方法,仅依靠与过去及未来描述文本的交叉注意力机制实现一致性。此外,我们提出一种条件引导技术,通过在对数空间构建文本条件与角色条件对数组合,以隐式方式聚焦于角色生成。我们还采用由大语言模型实现的描述增强技术,提升方法的鲁棒性。这些方法的结合,在最权威的故事可视化基准(Pororo-SV)的多项指标上取得了最优结果,且相比以往方法在约束资源下实现了更优的计算复杂度。我们量化结果的有效性得到了人工调查的支持。