Predicting the virality of online content remains challenging, especially for culturally complex, fast-evolving memes. This study investigates the feasibility of early prediction of meme virality using a large-scale, cross-lingual dataset from 25 diverse Reddit communities. We propose a robust, data-driven method to define virality based on a hybrid engagement score, learning a percentile-based threshold from a chronologically held-out training set to prevent data leakage. We evaluated a suite of models, including Logistic Regression, XGBoost, and a Multi-layer Perceptron (MLP), with a comprehensive, multimodal feature set across increasing time windows (30-420 min). Crucially, useful signals emerge quickly: our best-performing model, XGBoost, achieves a PR-AUC $>$ 0.52 in just 30 minutes. Our analysis reveals a clear "evidentiary transition," in which the importance of the feature dynamically shifts from the static context to the temporal dynamics as a meme gains traction. This work establishes a robust, interpretable, and practical benchmark for early virality prediction in scenarios where full diffusion cascade data is unavailable, contributing a novel cross-lingual dataset and a methodologically sound definition of virality. To our knowledge, this study is the first to combine time series data with static content and network features to predict early meme virality.
翻译:在线内容的传播性预测仍然具有挑战性,尤其对于文化背景复杂、快速演变的表情包而言。本研究利用来自25个多样化Reddit社区的大规模跨语言数据集,探讨了早期预测表情包传播性的可行性。我们提出了一种基于混合参与度评分的稳健数据驱动方法来定义传播性,通过按时间顺序划分的训练集学习基于百分位数的阈值以防止数据泄露。我们评估了一系列模型(包括逻辑回归、XGBoost和多层感知机),在递增时间窗口(30-420分钟)内使用全面的多模态特征集进行测试。关键发现是有效信号迅速显现:性能最佳的XGBoost模型在仅30分钟内即可实现PR-AUC $>$ 0.52。分析揭示出清晰的"证据转换"现象——随着表情包获得关注,特征重要性从静态语境向时序动态特征发生动态转移。本研究为无法获取完整传播级联数据的场景建立了稳健、可解释且实用的早期传播性预测基准,贡献了新颖的跨语言数据集与方法论严谨的传播性定义。据我们所知,这是首个结合时序数据与静态内容及网络特征来预测早期表情包传播性的研究。