Video RWKV:Video Action Recognition Based RWKV

To address the challenges of high computational costs and long-distance dependencies in exist ing video understanding methods, such as CNNs and Transformers, this work introduces RWKV to the video domain in a novel way. We propose a LSTM CrossRWKV (LCR) framework, designed for spatiotemporal representation learning to tackle the video understanding task. Specifically, the proposed linear complexity LCR incorporates a novel Cross RWKV gate to facilitate interaction be tween current frame edge information and past features, enhancing the focus on the subject through edge features and globally aggregating inter-frame features over time. LCR stores long-term mem ory for video processing through an enhanced LSTM recurrent execution mechanism. By leveraging the Cross RWKV gate and recurrent execution, LCR effectively captures both spatial and temporal features. Additionally, the edge information serves as a forgetting gate for LSTM, guiding long-term memory management.Tube masking strategy reduces redundant information in food and reduces overfitting.These advantages enable LSTM CrossRWKV to set a new benchmark in video under standing, offering a scalable and efficient solution for comprehensive video analysis. All code and models are publicly available.

翻译：针对现有视频理解方法（如CNN和Transformer）存在计算成本高和长距离依赖关系等挑战，本研究以创新方式将RWKV引入视频领域。我们提出了LSTM CrossRWKV（LCR）框架，专为时空表征学习而设计，以解决视频理解任务。具体而言，所提出的线性复杂度LCR框架融合了新型Cross RWKV门控机制，促进当前帧边缘信息与历史特征的交互，通过边缘特征增强对主体的关注，并随时间全局聚合帧间特征。LCR通过增强的LSTM循环执行机制为视频处理存储长期记忆。借助Cross RWKV门控和循环执行，LCR能有效捕获空间与时间特征。此外，边缘信息作为LSTM的遗忘门，指导长期记忆管理。管道掩码策略减少了食物场景中的冗余信息并降低过拟合风险。这些优势使得LSTM CrossRWKV在视频理解领域树立了新基准，为全面视频分析提供了可扩展的高效解决方案。所有代码与模型均已开源。

相关内容

长短期记忆网络

关注 120

长短期记忆网络(LSTM)是一种用于深度学习领域的人工回归神经网络(RNN)结构。与标准的前馈神经网络不同，LSTM具有反馈连接。它不仅可以处理单个数据点(如图像)，还可以处理整个数据序列(如语音或视频)。例如，LSTM适用于未分段、连接的手写识别、语音识别、网络流量或IDSs(入侵检测系统)中的异常检测等任务。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日