Towards Understanding of Deepfake Videos in the Wild

Deepfakes have become a growing concern in recent years, prompting researchers to develop benchmark datasets and detection algorithms to tackle the issue. However, existing datasets suffer from significant drawbacks that hamper their effectiveness. Notably, these datasets fail to encompass the latest deepfake videos produced by state-of-the-art methods that are being shared across various platforms. This limitation impedes the ability to keep pace with the rapid evolution of generative AI techniques employed in real-world deepfake production. Our contributions in this IRB-approved study are to bridge this knowledge gap from current real-world deepfakes by providing in-depth analysis. We first present the largest and most diverse and recent deepfake dataset (RWDF-23) collected from the wild to date, consisting of 2,000 deepfake videos collected from 4 platforms targeting 4 different languages span created from 21 countries: Reddit, YouTube, TikTok, and Bilibili. By expanding the dataset's scope beyond the previous research, we capture a broader range of real-world deepfake content, reflecting the ever-evolving landscape of online platforms. Also, we conduct a comprehensive analysis encompassing various aspects of deepfakes, including creators, manipulation strategies, purposes, and real-world content production methods. This allows us to gain valuable insights into the nuances and characteristics of deepfakes in different contexts. Lastly, in addition to the video content, we also collect viewer comments and interactions, enabling us to explore the engagements of internet users with deepfake content. By considering this rich contextual information, we aim to provide a holistic understanding of the {evolving} deepfake phenomenon and its impact on online platforms.

翻译：近年来，深度伪造技术日益引发关注，促使研究者开发基准数据集与检测算法以应对该问题。然而，现有数据集存在显著缺陷，制约了其有效性：这些数据集未能涵盖各平台传播的最新一代深度伪造视频（由当前最先进方法生成），从而无法跟上真实世界中生成式AI技术快速演进的步伐。本项经机构审查委员会（IRB）批准的研究旨在通过深入分析，弥合当前真实场景深度伪造的知识鸿沟。我们首先提出迄今为止规模最大、多样性最高且时间最新的野外深度伪造数据集（RWDF-23），该数据集包含来自21个国家、4个平台（Reddit、YouTube、TikTok、Bilibili）的2000条深度伪造视频，覆盖4种不同语言。通过将数据集范围扩展至以往研究之上，我们捕获了更广泛的真实场景深度伪造内容，反映在线平台不断演变的生态。其次，我们对深度伪造展开全面分析，涵盖创作者、操纵策略、目的及真实内容生产方法等多方面，从而获得不同语境下深度伪造的细微特征与本质洞见。最后，除视频内容外，我们还收集观众评论与互动数据，探究互联网用户与深度伪造内容之间的参与模式。通过整合这一丰富的上下文信息，我们旨在为深度伪造现象的演变及其对在线平台的影响提供系统性理解。