Recent advances in image and video creation, especially AI-based image synthesis, have led to the production of numerous visual scenes that exhibit a high level of abstractness and diversity. Consequently, Visual Storytelling (VST), a task that involves generating meaningful and coherent narratives from a collection of images, has become even more challenging and is increasingly desired beyond real-world imagery. While existing VST techniques, which typically use autoregressive decoders, have made significant progress, they suffer from low inference speed and are not well-suited for synthetic scenes. To this end, we propose a novel diffusion-based system DiffuVST, which models the generation of a series of visual descriptions as a single conditional denoising process. The stochastic and non-autoregressive nature of DiffuVST at inference time allows it to generate highly diverse narratives more efficiently. In addition, DiffuVST features a unique design with bi-directional text history guidance and multimodal adapter modules, which effectively improve inter-sentence coherence and image-to-text fidelity. Extensive experiments on the story generation task covering four fictional visual-story datasets demonstrate the superiority of DiffuVST over traditional autoregressive models in terms of both text quality and inference speed.
翻译:近年来,图像与视频生成技术,特别是基于AI的图像合成领域取得了显著进展,催生了大量具有高度抽象性和多样性的视觉场景。这使得视觉叙事任务——即从一组图像中生成有意义且连贯的叙述——变得更具挑战性,且其应用需求已日益超越真实世界图像范畴。尽管现有视觉叙事技术(通常采用自回归解码器)已取得显著进展,但其存在推理速度慢且不适用于合成场景的固有缺陷。为此,我们提出了一种基于扩散机制的新型系统DiffuVST,将视觉描述序列的生成建模为单一的条件去噪过程。DiffuVST在推理时具有随机性与非自回归特性,从而能够以更高效率生成高度多样化的叙事。此外,DiffuVST独特设计了双向文本历史引导模块与多模态适配器模块,有效提升了句子间连贯性与图文映射保真度。在覆盖四个虚构视觉故事数据集的故事生成任务上的大量实验表明,DiffuVST在文本质量与推理速度两方面均优于传统自回归模型。