We introduce PoPreRo, the first dataset for Popularity Prediction of Romanian posts collected from Reddit. The PoPreRo dataset includes a varied compilation of post samples from five distinct subreddits of Romania, totaling 28,107 data samples. Along with our novel dataset, we introduce a set of competitive models to be used as baselines for future research. Interestingly, the top-scoring model achieves an accuracy of 61.35% and a macro F1 score of 60.60% on the test set, indicating that the popularity prediction task on PoPreRo is very challenging. Further investigations based on few-shot prompting the Falcon-7B Large Language Model also point in the same direction. We thus believe that PoPreRo is a valuable resource that can be used to evaluate models on predicting the popularity of social media posts in Romanian. We release our dataset at https://github.com/ana-rogoz/PoPreRo.
翻译:本文介绍了PoPreRo,这是首个用于罗马尼亚Reddit帖子流行度预测的数据集。PoPreRo数据集包含从罗马尼亚五个不同子版块收集的多样化帖子样本,总计28,107个数据样本。除了这一新颖数据集,我们还提出了一组具有竞争力的基准模型以供后续研究参考。值得注意的是,表现最佳的模型在测试集上达到了61.35%的准确率和60.60%的宏平均F1分数,表明PoPreRo上的流行度预测任务极具挑战性。基于Falcon-7B大语言模型进行少样本提示的进一步研究也指向相同结论。因此,我们认为PoPreRo是一个有价值的资源,可用于评估针对罗马尼亚社交媒体帖子流行度预测的模型性能。本数据集已发布于https://github.com/ana-rogoz/PoPreRo。