Social media, as a means for computer-mediated communication, has been extensively used to study the sentiment expressed by users around events or topics. There is however a gap in the longitudinal study of how sentiment evolved in social media over the years. To fill this gap, we develop TM-Senti, a new large-scale, distantly supervised Twitter sentiment dataset with over 184 million tweets and covering a time period of over seven years. We describe and assess our methodology to put together a large-scale, emoticon- and emoji-based labelled sentiment analysis dataset, along with an analysis of the resulting dataset. Our analysis highlights interesting temporal changes, among others in the increasing use of emojis over emoticons. We publicly release the dataset for further research in tasks including sentiment analysis and text classification of tweets. The dataset can be fully rehydrated including tweet metadata and without missing tweets thanks to the archive of tweets publicly available on the Internet Archive, which the dataset is based on.
翻译:社交媒体作为一种计算机中介的沟通方式,已被广泛用于研究用户在事件或话题中表达的情感。然而,对于社交媒体中情感随年份演变的纵向研究仍存在空白。为填补这一空白,我们构建了TM-Senti——一个全新的大规模远程监督推特情感数据集,包含超过1.84亿条推文,时间跨度超过七年。我们详细描述并评估了构建基于表情符号和绘文字标注的大规模情感分析数据集的方法,并对最终数据集进行了分析。我们的分析揭示了有趣的时间变化特征,其中包括绘文字使用率逐渐超越传统表情符号的现象。我们公开释放该数据集以支持推文情感分析和文本分类等任务的进一步研究。得益于数据集所基于的互联网档案馆公开推特存档,该数据集可完全复现(包括推文元数据)且无推文缺失。