This paper introduces two large-scale multilingual comment datasets, YT-30M (and YT-100K) from YouTube. The analysis in this paper is performed on a smaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and YT-100K (randomly selected 100K sample from YT-30M) are publicly released for further research. YT-30M (YT-100K) contains 32236173 (108694) comments posted by YouTube channel that belong to YouTube categories. Each comment is associated with a video ID, comment ID, commentor name, commentor channel ID, comment text, upvotes, original channel ID and category of the YouTube channel (e.g., 'News & Politics', 'Science & Technology', etc.).
翻译:本文介绍了两个来自YouTube的大规模多语言评论数据集:YT-30M(以及YT-100K)。本文的分析基于YT-30M的一个较小样本(YT-100K)进行。两个数据集——YT-30M(完整版)和YT-100K(从YT-30M中随机抽取的10万条样本)均已公开发布以供进一步研究。YT-30M(YT-100K)包含32,236,173(108,694)条评论,这些评论发布在属于YouTube分类的频道下。每条评论关联有视频ID、评论ID、评论者名称、评论者频道ID、评论文本、点赞数、原始频道ID以及YouTube频道类别(例如“新闻与政治”、“科学与技术”等)。