TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation

In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.

翻译：本文提出TalkCuts——一个专为多镜头人声视频生成研究设计的大规模数据集。与现有聚焦单镜头静态视角的数据集不同，TalkCuts包含总计超过500小时的16.4万条高质量人声视频片段，涵盖特写、半身及全身等多种镜头景别。数据集提供详尽的文本描述、2D关键点与3D SMPL-X运动标注，覆盖超1万个不同身份，支持多模态学习与评估。为初步验证数据集价值，我们提出Orator——一个基于大语言模型引导的多模态生成框架作为简易基线。该框架中语言模型充当多维度导演角色，统筹规划镜头转场、说话者姿态与语音调制的精细参数，通过集成的多模态视频生成模块实现连贯长视频的合成。在姿态引导与音频驱动场景下的广泛实验表明，基于TalkCuts的训练能显著提升生成多镜头人声视频的镜头连贯性与视觉表现力。我们相信TalkCuts将为可控多镜头人声视频生成及更广泛的多模态学习研究奠定坚实基础。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日