Social media popularity (SMP) prediction is a complex task involving multi-modal data integration. While pre-trained vision-language models (VLMs) like CLIP have been widely adopted for this task, their effectiveness in capturing the unique characteristics of social media content remains unexplored. This paper critically examines the applicability of CLIP-based features in SMP prediction, focusing on the overlooked phenomenon of semantic inconsistency between images and text in social media posts. Through extensive analysis, we demonstrate that this inconsistency increases with post popularity, challenging the conventional use of VLM features. We provide a comprehensive investigation of semantic inconsistency across different popularity intervals and analyze the impact of VLM feature adaptation on SMP tasks. Our experiments reveal that incorporating inconsistency measures and adapted text features significantly improves model performance, achieving an SRC of 0.729 and an MAE of 1.227. These findings not only enhance SMP prediction accuracy but also provide crucial insights for developing more targeted approaches in social media analysis.
翻译:社交媒体流行度预测是一项涉及多模态数据整合的复杂任务。尽管像CLIP这样的预训练视觉-语言模型已被广泛用于此任务,但其在捕捉社交媒体内容独特特性方面的有效性尚未得到充分探索。本文批判性地审视了基于CLIP的特征在社交媒体流行度预测中的适用性,重点关注了社交媒体帖子中图像与文本间语义不一致这一被忽视的现象。通过广泛分析,我们证明这种不一致性随帖子流行度的增加而增强,这对传统视觉-语言模型特征的使用提出了挑战。我们对不同流行度区间内的语义不一致性进行了全面研究,并分析了视觉-语言模型特征适应对社交媒体流行度预测任务的影响。实验结果表明,融入不一致性度量与适应后的文本特征能显著提升模型性能,实现了0.729的SRC和1.227的MAE。这些发现不仅提高了社交媒体流行度预测的准确性,还为开发更具针对性的社交媒体分析方法提供了重要见解。