This paper introduces a large collection of time series data derived from Twitter, postprocessed using word embedding techniques, as well as specialized fine-tuned language models. This data comprises the past five years and captures changes in n-gram frequency, similarity, sentiment and topic distribution. The interface built on top of this data enables temporal analysis for detecting and characterizing shifts in meaning, including complementary information to trending metrics, such as sentiment and topic association over time. We release an online demo for easy experimentation, and we share code and the underlying aggregated data for future work. In this paper, we also discuss three case studies unlocked thanks to our platform, showcasing its potential for temporal linguistic analysis.
翻译:本文介绍了一个从Twitter获取的大规模时间序列数据集,该数据集通过词嵌入技术及专门的微调语言模型进行后处理。该数据涵盖过去五年,并捕捉了n-gram频率、相似度、情感和主题分布的变化。基于此数据构建的界面支持时间分析,用于检测和表征语义变化,包括对趋势指标的补充信息(例如随时间变化的情感与主题关联)。我们发布了在线演示以便实验,并共享代码及基础聚合数据供未来研究使用。本文还通过三个案例研究展示了我们的平台的潜力,突显其在时间语言分析中的应用价值。