To address the challenges of high computational costs and long-distance dependencies in exist ing video understanding methods, such as CNNs and Transformers, this work introduces RWKV to the video domain in a novel way. We propose a LSTM CrossRWKV (LCR) framework, designed for spatiotemporal representation learning to tackle the video understanding task. Specifically, the proposed linear complexity LCR incorporates a novel Cross RWKV gate to facilitate interaction be tween current frame edge information and past features, enhancing the focus on the subject through edge features and globally aggregating inter-frame features over time. LCR stores long-term mem ory for video processing through an enhanced LSTM recurrent execution mechanism. By leveraging the Cross RWKV gate and recurrent execution, LCR effectively captures both spatial and temporal features. Additionally, the edge information serves as a forgetting gate for LSTM, guiding long-term memory management.Tube masking strategy reduces redundant information in food and reduces overfitting.These advantages enable LSTM CrossRWKV to set a new benchmark in video under standing, offering a scalable and efficient solution for comprehensive video analysis. All code and models are publicly available.
翻译:针对现有视频理解方法(如CNN和Transformer)存在计算成本高和长距离依赖关系等挑战,本研究以创新方式将RWKV引入视频领域。我们提出了LSTM CrossRWKV(LCR)框架,专为时空表征学习而设计,以解决视频理解任务。具体而言,所提出的线性复杂度LCR框架融合了新型Cross RWKV门控机制,促进当前帧边缘信息与历史特征的交互,通过边缘特征增强对主体的关注,并随时间全局聚合帧间特征。LCR通过增强的LSTM循环执行机制为视频处理存储长期记忆。借助Cross RWKV门控和循环执行,LCR能有效捕获空间与时间特征。此外,边缘信息作为LSTM的遗忘门,指导长期记忆管理。管道掩码策略减少了食物场景中的冗余信息并降低过拟合风险。这些优势使得LSTM CrossRWKV在视频理解领域树立了新基准,为全面视频分析提供了可扩展的高效解决方案。所有代码与模型均已开源。