As a common form of communication in social media,stickers win users' love in the internet scenarios, for their ability to convey emotions in a vivid, cute, and interesting way. People prefer to get an appropriate sticker through retrieval rather than creation for the reason that creating a sticker is time-consuming and relies on rule-based creative tools with limited capabilities. Nowadays, advanced text-to-video algorithms have spawned numerous general video generation systems that allow users to customize high-quality, photo-realistic videos by only providing simple text prompts. However, creating customized animated stickers, which have lower frame rates and more abstract semantics than videos, is greatly hindered by difficulties in data acquisition and incomplete benchmarks. To facilitate the exploration of researchers in animated sticker generation (ASG) field, we firstly construct the currently largest vision-language sticker dataset named VSD2M at a two-million scale that contains static and animated stickers. Secondly, to improve the performance of traditional video generation methods on ASG tasks with discrete characteristics, we propose a Spatial Temporal Interaction (STI) layer that utilizes semantic interaction and detail preservation to address the issue of insufficient information utilization. Moreover, we train baselines with several video generation methods (e.g., transformer-based, diffusion-based methods) on VSD2M and conduct a detailed analysis to establish systemic supervision on ASG task. To the best of our knowledge, this is the most comprehensive large-scale benchmark for multi-frame animated sticker generation, and we hope this work can provide valuable inspiration for other scholars in intelligent creation.
翻译:作为一种社交媒体中常见的交流形式,贴纸因其能以生动、可爱且有趣的方式传达情感而赢得网络场景中用户的喜爱。由于创建贴纸耗时且依赖于能力有限的基于规则的创意工具,人们更倾向于通过检索而非创作来获取合适的贴纸。如今,先进的文本到视频算法催生了众多通用视频生成系统,允许用户仅通过提供简单的文本提示即可定制高质量、逼真的视频。然而,定制动画贴纸的创建——其帧率低于视频且语义更为抽象——因数据获取困难和基准不完整而受到极大阻碍。为促进动画贴纸生成(ASG)领域研究者的探索,我们首先构建了当前最大规模的视觉-语言贴纸数据集VSD2M,其规模达两百万,包含静态和动画贴纸。其次,为提升传统视频生成方法在具有离散特性的ASG任务上的性能,我们提出了一种时空交互(STI)层,利用语义交互和细节保留来解决信息利用不足的问题。此外,我们在VSD2M上使用多种视频生成方法(例如基于Transformer、基于扩散的方法)训练基线模型,并进行详细分析,以建立对ASG任务的系统性监督。据我们所知,这是多帧动画贴纸生成领域最全面的大规模基准,我们希望这项工作能为智能创作领域的其他学者提供有价值的启发。