Social media, as a means for computer-mediated communication, has been extensively used to study the sentiment expressed by users around events or topics. There is however a gap in the longitudinal study of how sentiment evolved in social media over the years. To fill this gap, we develop TM-Senti, a new large-scale, distantly supervised Twitter sentiment dataset with over 184 million tweets and covering a time period of over seven years. We describe and assess our methodology to put together a large-scale, emoticon- and emoji-based labelled sentiment analysis dataset, along with an analysis of the resulting dataset. Our analysis highlights interesting temporal changes, among others in the increasing use of emojis over emoticons. We publicly release the dataset for further research in tasks including sentiment analysis and text classification of tweets. The dataset can be fully rehydrated including tweet metadata and without missing tweets thanks to the archive of tweets publicly available on the Internet Archive, which the dataset is based on.
翻译:社交媒体作为计算机中介通信的手段,已被广泛用于研究用户围绕事件或话题所表达的情感。然而,关于社交媒体情感在多年间如何演变的纵向研究仍存在空白。为填补这一空白,我们构建了TM-Senti——一个覆盖七年以上时间跨度、包含超过1.84亿条推文的大规模远程监督推特情感数据集。我们描述并评估了构建大规模基于表情符号和表情图标的标注情感分析数据集的方法论,并对所得数据集进行了分析。分析揭示了显著的时间变化趋势,其中尤为突出的是表情图标对传统表情符号的日益取代。我们公开释出该数据集,以推动包括推文情感分析与文本分类在内的后续研究。基于互联网档案馆公开的推文存档,该数据集可被完整还原,包含推文元数据且无缺失推文。