LottieGPT: Tokenizing Vector Animation for Autoregressive Generation

Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides, 3D meshes, LEGO sequences, and indoor layouts, suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence. To support large-scale training, we also construct LottieAnimation-660K, the largest and most diverse vector animation dataset to date, consisting of 660k real-world Lottie animation and 15M static Lottie image files curated from broad Internet sources. Building upon these components, we finetune Qwen-VL to create LottieGPT, a native multimodal model capable of generating coherent, editable vector animations directly from natural language or visual prompts. Experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content. LottieGPT exhibits strong generalization across diverse animation styles and outperforms previous state-of-the-art models on SVG generation (a special case of single-frame vector animation).

翻译：尽管视频生成技术取得了快速进展，现有模型仍无法生成矢量动画——这是互联网上一种占主导地位且具有高度表现力的多媒体形式。矢量动画具备分辨率无关性、紧凑性、语义结构以及可编辑的参数化运动表示，然而当前生成模型仅在光栅空间内运行，因此无法合成此类内容。与此同时，大型多模态模型的最新进展在生成结构化数据（如幻灯片、三维网格、乐高序列及室内布局）方面展现出强大能力，这表明原生矢量动画生成或可实现。在本研究中，我们提出了首个用于分词和自回归生成矢量动画的框架。我们采用Lottie（一种广泛部署的基于JSON的动画标准），并设计了定制化的Lottie分词器，将分层的几何图元、变换及基于关键帧的运动编码为紧凑且语义对齐的标记序列。为支持大规模训练，我们还构建了LottieAnimation-660K——迄今为止规模最大且多样性最丰富的矢量动画数据集，包含从广泛互联网资源中精选的66万条真实Lottie动画及1500万张静态Lottie图像文件。基于这些组件，我们对Qwen-VL进行微调以创建LottieGPT——一种原生多模态模型，能够直接从自然语言或视觉提示生成连贯且可编辑的矢量动画。实验表明，我们的分词器在保持结构保真度的同时显著缩短了序列长度，从而实现了对动态矢量内容的有效自回归学习。LottieGPT在多种动画风格上展现出强大的泛化能力，并在SVG生成（单帧矢量动画的特殊情况）任务上超越了先前的最优模型。