A Camera-Native Talking-Head Video Dataset for Various Computer Vision Tasks

Talking-head videos constitute a predominant content type in real-time communication, yet publicly available datasets for video processing research in this domain remain scarce and limited in signal fidelity. In this paper, we open-source a camera-native dataset of 847 talking-head recordings (approximately 212 minutes), each 15s in duration, captured from 805 participants using 446 unique consumer webcam devices in their natural environments. All recordings are stored using the FFV1 lossless codec, preserving the camera-native signal -- uncompressed (24.4%) or MJPEG-encoded (75.6%) -- without additional lossy processing. Each recording is annotated with a Mean Opinion Score (MOS) and ten perceptual quality tokens that jointly explain 64.4% of the MOS variance. From this corpus, we curate a stratified benchmarking subset of 120 clips in three content conditions: original, background blur, and background replacement. Codec efficiency evaluation across four datasets and four codecs, namely H.264, H.265, H.266, and AV1, yields VMAF BD-rate savings up to $-71.3\%$ (H.266) relative to H.264, with significant encoder$\times$dataset ($η_p^2 = .112$) and encoder$\times$content condition ($η_p^2 = .149$) interactions, demonstrating that both content type and background processing affect compression efficiency. A preliminary super-resolution evaluation with four SR models confirms that the dataset significantly affects absolute performance while preserving model rankings, demonstrating applicability beyond codec benchmarking. The dataset offers 5$\times$ the scale of the largest prior talking-head webcam dataset (847 vs. 160 clips) with lossless signal fidelity, establishing a resource for benchmarking video compression, super-resolution, quality assessment, and enhancement models in real-time communication.

翻译：说话头部视频是实时通信中的主要内容类型，然而当前该领域用于视频处理研究的公开数据集在数量与信号保真度方面仍十分有限。本文开源了一个包含847段说话头部视频（约212分钟）的摄像头原生数据集，每段时长15秒，由805名参与者在自然环境下通过446种不同的消费级网络摄像头设备采集完成。所有视频均采用FFV1无损编解码器存储，完整保留了摄像头原生信号（24.4%为未压缩格式，75.6%为MJPEG编码格式），未引入额外有损处理。每段视频包含平均意见得分（MOS）标注和十项感知质量标记，这些标记联合解释了64.4%的MOS方差。基于该语料库，我们筛选出一个包含120个样本的分层基准测试子集，涵盖三种内容条件：原始视频、背景模糊视频及背景替换视频。跨四个数据集与四种编解码器（H.264、H.265、H.266及AV1）进行的编码效率评估显示，相较H.264，VMAF BD-rate最高节省-71.3%（H.266），且存在显著的编码器×数据集（η_p²=.112）和编码器×内容条件（η_p²=.149）交互效应，表明内容类型和背景处理均会影响压缩效率。利用四种超分辨率模型进行的初步评估证实，该数据集在保持模型排名不变的情况下显著影响绝对性能，证明其应用范围超越编解码器基准测试。该数据集的规模是此前最大说话头部网络摄像头数据集的5倍（847段对比160段），且具备无损信号保真度，为实时通信中的视频压缩、超分辨率、质量评估及增强模型基准测试建立了重要资源。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

面向计算机视觉的数据生成与应用研究进展

专知会员服务

14+阅读 · 2025年5月10日

探索长视频生成的最新趋势

专知会员服务

23+阅读 · 2024年12月30日

【CVPR2024】OmniViD: 一个用于通用视频理解的生成框架

专知会员服务

25+阅读 · 2024年3月27日

【CVPR2023】高保真自由可控的说话头视频生成

专知会员服务

21+阅读 · 2023年4月22日