Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction

Micro-video popularity prediction (MVPP) aims to forecast the future popularity of videos on online media, which is essential for applications such as content recommendation and traffic allocation. In real-world scenarios, it is critical for MVPP approaches to understand both the temporal dynamics of a given video (temporal) and its historical relevance to other videos (spatial). However, existing approaches sufer from limitations in both dimensions: temporally, they rely on sparse short-range sampling that restricts content perception; spatially, they depend on flat retrieval memory with limited capacity and low efficiency, hindering scalable knowledge utilization. To overcome these limitations, we propose a unified framework that achieves joint spatio-temporal enlargement, enabling precise perception of extremely long video sequences while supporting a scalable memory bank that can infinitely expand to incorporate all relevant historical videos. Technically, we employ a Temporal Enlargement driven by a frame scoring module that extracts highlight cues from video frames through two complementary pathways: sparse sampling and dense perception. Their outputs are adaptively fused to enable robust long-sequence content understanding. For Spatial Enlargement, we construct a Topology-Aware Memory Bank that hierarchically clusters historically relevant content based on topological relationships. Instead of directly expanding memory capacity, we update the encoder features of the corresponding clusters when incorporating new videos, enabling unbounded historical association without unbounded storage growth. Extensive experiments on three widely used MVPP benchmarks demonstrate that our method consistently outperforms 11 strong baselines across mainstream metrics, achieving robust improvements in both prediction accuracy and ranking consistency.

翻译：微视频流行度预测旨在预测视频在在线媒体上的未来流行程度，这对于内容推荐和流量分配等应用至关重要。在实际场景中，流行度预测方法需同时理解给定视频的时间动态性及其与其他视频的历史相关性。然而，现有方法在这两个维度均存在局限：在时间维度上，它们依赖稀疏的短程采样，限制了内容感知能力；在空间维度上，它们依赖具有有限容量和低效率的平面检索记忆，阻碍了可扩展的知识利用。为克服这些限制，我们提出一个统一框架，实现联合时空扩展，既能精确感知极长视频序列，又能支持可无限扩展的记忆库以整合所有相关历史视频。在技术层面，我们采用由帧评分模块驱动的时间扩展机制，通过两条互补路径（稀疏采样与密集感知）从视频帧中提取高亮线索，并自适应融合其输出以实现鲁棒的长时间序列内容理解。对于空间扩展，我们构建基于拓扑感知的记忆库，根据拓扑关系对历史相关内容进行层次化聚类。该方法并非直接扩展记忆容量，而是在融入新视频时更新对应聚类的编码器特征，从而在无存储无限增长的情况下实现无界历史关联。在三个广泛使用的微视频流行度预测基准上的大量实验表明，我们的方法在主流指标上持续优于11个强基线模型，在预测准确性和排序一致性方面均实现稳健提升。