TikTok is one of the largest and fastest-growing social media sites in the world. TikTok features, however, such as voice transcripts, are often missing and other important features, such as OCR or video descriptions, do not exist. We introduce the Generative AI Enriched TikTok (GET-Tok) data, a pipeline for collecting TikTok videos and enriched data by augmenting the TikTok Research API with generative AI models. As a case study, we collect videos about the attempted coup in Peru initiated by its former President, Pedro Castillo, and its accompanying protests. The data includes information on 43,697 videos published from November 20, 2022 to March 1, 2023 (102 days). Generative AI augments the collected data via transcripts of TikTok videos, text descriptions of what is shown in the videos, what text is displayed within the video, and the stances expressed in the video. Overall, this pipeline will contribute to a better understanding of online discussion in a multimodal setting with applications of Generative AI, especially outlining the utility of this pipeline in non-English-language social media. Our code used to produce the pipeline is in a public Github repository: https://github.com/gabbypinto/GET-Tok-Peru.
翻译:摘要:TikTok是全球规模最大、增长最快的社交媒体平台之一。然而,其关键功能如语音转录常缺失,而光学字符识别或视频描述等其他重要功能尚未实现。我们提出生成式AI增强TikTok(GET-Tok)数据,通过将生成式AI模型与TikTok研究应用程序编程接口结合,构建了一套用于收集TikTok视频及增强数据的标准化流程。作为案例研究,我们收集了关于秘鲁前总统佩德罗·卡斯蒂略发起的未遂政变及相关抗议活动的视频。数据集涵盖2022年11月20日至2023年3月1日(共102天)发布的43,697条视频信息。生成式AI通过视频转录、视频画面文本描述、视频内嵌文字提取及视频立场分析,对原始数据进行增强。该流程将有助于在生成式AI应用背景下,更深入地理解多模态场景中的线上讨论,尤其凸显了该流程在非英语社交媒体中的实用性。本研究的流程代码已公开于GitHub仓库:https://github.com/gabbypinto/GET-Tok-Peru。