LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation

Revolutionary advancements in text-to-image models have unlocked new dimensions for sophisticated content creation, e.g., text-conditioned image editing, allowing us to edit the diverse images that convey highly complex visual concepts according to the textual guidance. Despite being promising, existing methods focus on texture- or non-rigid-based visual manipulation, which struggles to produce the fine-grained animation of smooth text-conditioned image morphing without fine-tuning, i.e., due to their highly unstructured latent space. In this paper, we introduce a tuning-free LLM-driven attention control framework, encapsulated by the progressive process of LLM planning, prompt-Aware editing, StablE animation geneRation, abbreviated as LASER. LASER employs a large language model (LLM) to refine coarse descriptions into detailed prompts, guiding pre-trained text-to-image models for subsequent image generation. We manipulate the model's spatial features and self-attention mechanisms to maintain animation integrity and enable seamless morphing directly from text prompts, eliminating the need for additional fine-tuning or annotations. Our meticulous control over spatial features and self-attention ensures structural consistency in the images. This paper presents a novel framework integrating LLMs with text-to-image models to create high-quality animations from a single text input. We also propose a Text-conditioned Image-to-Animation Benchmark to validate the effectiveness and efficacy of LASER. Extensive experiments demonstrate that LASER produces impressive, consistent, and efficient results in animation generation, positioning it as a powerful tool for advanced digital content creation.

翻译：文本到图像模型的革命性进展为复杂内容创作开辟了新维度，例如文本条件图像编辑使我们能根据文本指导编辑传达高度复杂视觉概念的多样化图像。尽管前景可观，现有方法主要关注纹理或非刚性视觉操作，却难以在无需微调的情况下生成平滑的文本条件图像变形动画，原因在于其高度非结构化的潜在空间。本文提出一种无需调优的LLM驱动注意力控制框架，其核心流程可概括为LLM规划、提示感知编辑与稳定动画生成，缩写为LASER。LASER利用大语言模型将粗粒度描述精炼为详细提示，指导预训练文本到图像模型进行后续图像生成。我们通过操控模型的空间特征与自注意力机制，在保持动画完整性的同时实现从文本提示直接生成无缝变形效果，完全无需额外微调或标注。对空间特征与自注意力的精细控制确保了图像的结构一致性。本文提出一种新型框架，将LLM与文本到图像模型整合，仅凭单一文本输入即可生成高质量动画。此外，我们提出文本条件图像到动画基准以验证LASER的有效性与效率。大量实验表明，LASER在动画生成中呈现出令人印象深刻、内容一致且高效的结果，使其成为先进数字内容创作的强大工具。