VSD2M: A Large-scale Vision-language Sticker Dataset for Multi-frame Animated Sticker Generation

As a common form of communication in social media,stickers win users' love in the internet scenarios, for their ability to convey emotions in a vivid, cute, and interesting way. People prefer to get an appropriate sticker through retrieval rather than creation for the reason that creating a sticker is time-consuming and relies on rule-based creative tools with limited capabilities. Nowadays, advanced text-to-video algorithms have spawned numerous general video generation systems that allow users to customize high-quality, photo-realistic videos by only providing simple text prompts. However, creating customized animated stickers, which have lower frame rates and more abstract semantics than videos, is greatly hindered by difficulties in data acquisition and incomplete benchmarks. To facilitate the exploration of researchers in animated sticker generation (ASG) field, we firstly construct the currently largest vision-language sticker dataset named VSD2M at a two-million scale that contains static and animated stickers. Secondly, to improve the performance of traditional video generation methods on ASG tasks with discrete characteristics, we propose a Spatial Temporal Interaction (STI) layer that utilizes semantic interaction and detail preservation to address the issue of insufficient information utilization. Moreover, we train baselines with several video generation methods (e.g., transformer-based, diffusion-based methods) on VSD2M and conduct a detailed analysis to establish systemic supervision on ASG task. To the best of our knowledge, this is the most comprehensive large-scale benchmark for multi-frame animated sticker generation, and we hope this work can provide valuable inspiration for other scholars in intelligent creation.

翻译：作为一种在社交媒体中常见的交流形式，贴纸因其能以生动、可爱且有趣的方式传达情感而赢得用户在互联网场景中的喜爱。由于贴纸创作耗时且依赖于能力有限的基于规则的创意工具，人们更倾向于通过检索而非创作来获得合适的贴纸。如今，先进的文本到视频算法催生了众多通用视频生成系统，允许用户仅通过提供简单的文本提示即可定制高质量、逼真的视频。然而，定制动画贴纸的创作——其帧率低于视频且语义更为抽象——因数据获取困难和基准不完整而受到极大阻碍。为促进动画贴纸生成领域研究者的探索，我们首先构建了目前规模最大的视觉-语言贴纸数据集VSD2M，其规模达两百万，包含静态和动画贴纸。其次，为提升传统视频生成方法在具有离散特性的动画贴纸生成任务上的性能，我们提出了一种时空交互层，利用语义交互和细节保留来解决信息利用不足的问题。此外，我们在VSD2M数据集上使用多种视频生成方法（例如基于Transformer、基于扩散的方法）训练基线模型，并进行详细分析，以建立对动画贴纸生成任务的系统性监督。据我们所知，这是目前最全面的多帧动画贴纸生成大规模基准，我们希望这项工作能为智能创作领域的其他学者提供有价值的启发。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日