Twitter is currently one of the biggest social media platforms. Its users may share, read, and engage with short posts called tweets. For the ACM Recommender Systems Conference 2020, Twitter published a dataset around 70 GB in size for the annual RecSys Challenge. In 2020, the RecSys Challenge invited participating teams to create models that would predict engagement likelihoods for given user-tweet combinations. The submitted models predicting like, reply, retweet, and quote engagements were evaluated based on two metrics: area under the precision-recall curve (PRAUC) and relative cross-entropy (RCE). In this diploma thesis, we used the RecSys 2020 Challenge dataset and evaluation procedure to investigate how well context alone may be used to predict tweet engagement likelihood. In doing so, we employed the Spark engine on TU Wien's Little Big Data Cluster to create scalable data preprocessing, feature engineering, feature selection, and machine learning pipelines. We manually created just under 200 additional features to describe tweet context. The results indicate that features describing users' prior engagement history and the popularity of hashtags and links in the tweet were the most informative. We also found that factors such as the prediction algorithm, training dataset size, training dataset sampling method, and feature selection significantly affect the results. After comparing the best results of our context-only prediction models with content-only models and with models developed by the Challenge winners, we identified that the context-based models underperformed in terms of the RCE score. This work thus concludes by situating this discrepancy and proposing potential improvements to our implementation, which is shared in a public git repository.
翻译:Twitter是目前最大的社交媒体平台之一。其用户可以分享、阅读和参与称为推文的短帖。2020年,为ACM推荐系统会议,Twitter发布了约70GB大小的数据集,用于年度RecSys挑战赛。该挑战赛邀请参赛团队构建模型,预测给定用户-推文组合的互动可能性。提交的模型需预测点赞、回复、转发和引用四种互动,并基于两个指标进行评估:精确率-召回率曲线下面积(PRAUC)和相对交叉熵(RCE)。在本硕士论文中,我们利用RecSys 2020挑战赛的数据集和评估流程,探究仅凭上下文信息在多大程度上能预测推文互动可能性。为此,我们在维也纳工业大学的小型大数据集群上使用Spark引擎,构建了可扩展的数据预处理、特征工程、特征选择和机器学习流程。我们手动创建了近200个额外特征来描述推文上下文。结果表明,描述用户先前互动历史以及推文中标签和链接流行度的特征最具信息量。我们还发现,预测算法、训练集大小、训练集采样方法和特征选择等因素显著影响结果。通过将最优的纯上下文预测模型与纯内容模型及挑战赛获胜者开发的模型进行对比,我们发现基于上下文的模型在RCE评分上表现不佳。本文最后探讨了这一差距,并提出了改进我们实现方案的潜在方向,该实现已共享于公共git仓库。