Telegram is one of the most popular instant messaging apps in today's digital age. In addition to providing a private messaging service, Telegram, with its channels, represents a valid medium for rapidly broadcasting content to a large audience (COVID-19 announcements), but, unfortunately, also for disseminating radical ideologies and coordinating attacks (Capitol Hill riot). This paper presents the TGDataset, a new dataset that includes 120,979 Telegram channels and over 400 million messages, making it the largest collection of Telegram channels to the best of our knowledge. After a brief introduction to the data collection process, we analyze the languages spoken within our dataset and the topic covered by English channels. Finally, we discuss some use cases in which our dataset can be extremely useful to understand better the Telegram ecosystem, as well as to study the diffusion of questionable news. In addition to the raw dataset, we released the scripts we used to analyze the dataset and the list of channels belonging to the network of a new conspiracy theory called Sabmyk.
翻译:Telegram是当今数字时代最流行的即时通讯应用之一。除提供私人消息服务外,Telegram通过其频道成为向广大受众快速传播内容(如COVID-19公告)的有效媒介,但不幸的是,它也被用于传播激进意识形态和协调攻击行动(如国会山暴乱)。本文介绍了TGDataset——一个新数据集,包含120,979个Telegram频道和超过4亿条消息,据我们所知,这是目前规模最大的Telegram频道集合。在简要介绍数据收集过程后,我们分析了数据集中的语言分布以及英文频道所涵盖的主题。最后,我们讨论了该数据集在多个应用场景中的极佳价值,包括更深入理解Telegram生态系统以及研究可疑新闻的传播。除原始数据集外,我们还发布了用于分析数据集的脚本,以及一个名为Sabmyk的新阴谋论网络所关联的频道列表。