The emojification of sentiment on social media: Collection and analysis of a longitudinal Twitter sentiment dataset

Social media, as a means for computer-mediated communication, has been extensively used to study the sentiment expressed by users around events or topics. There is however a gap in the longitudinal study of how sentiment evolved in social media over the years. To fill this gap, we develop TM-Senti, a new large-scale, distantly supervised Twitter sentiment dataset with over 184 million tweets and covering a time period of over seven years. We describe and assess our methodology to put together a large-scale, emoticon- and emoji-based labelled sentiment analysis dataset, along with an analysis of the resulting dataset. Our analysis highlights interesting temporal changes, among others in the increasing use of emojis over emoticons. We publicly release the dataset for further research in tasks including sentiment analysis and text classification of tweets. The dataset can be fully rehydrated including tweet metadata and without missing tweets thanks to the archive of tweets publicly available on the Internet Archive, which the dataset is based on.

翻译：社交媒体作为一种计算机中介的沟通方式，已被广泛用于研究用户在事件或话题中表达的情感。然而，对于社交媒体中情感随年份演变的纵向研究仍存在空白。为填补这一空白，我们构建了TM-Senti——一个全新的大规模远程监督推特情感数据集，包含超过1.84亿条推文，时间跨度超过七年。我们详细描述并评估了构建基于表情符号和绘文字标注的大规模情感分析数据集的方法，并对最终数据集进行了分析。我们的分析揭示了有趣的时间变化特征，其中包括绘文字使用率逐渐超越传统表情符号的现象。我们公开释放该数据集以支持推文情感分析和文本分类等任务的进一步研究。得益于数据集所基于的互联网档案馆公开推特存档，该数据集可完全复现（包括推文元数据）且无推文缺失。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【AI应用】Facebook-利用神经网络求解高等数学方程, Using neural networks to solve advanced mathematics equations

专知会员服务

34+阅读 · 2020年1月15日