We propose a novel block for \emph{causal} video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture \emph{TRecViT} is causal and shows strong performance on sparse and dense tasks, trained in supervised or self-supervised regimes, being the first causal video model in the state-space models family. Notably, our model outperforms or is on par with the popular (non-causal) ViViT-L model on large scale video datasets (SSv2, Kinetics400), while having $3\times$ less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count than the full self-attention ViViT, with an inference throughput of about 300 frames per second, running comfortably in real-time. When compared with causal transformer-based models (TSM, RViT) and other recurrent models like LSTM, TRecViT obtains state-of-the-art results on the challenging SSv2 dataset. Code and checkpoints are available online https://github.com/google-deepmind/trecvit.
翻译:我们提出了一种用于\emph{因果}视频建模的新型模块。它依赖于时间-空间-通道的分解,每个维度都有专门的模块:门控线性循环单元(LRU)执行时间维度的信息混合,自注意力层执行空间维度的混合,而MLP则处理通道维度。由此产生的架构\emph{TRecViT}是因果性的,并在稀疏和密集任务上表现出强大的性能,无论是监督式还是自监督式训练,成为状态空间模型家族中首个因果视频模型。值得注意的是,在大型视频数据集(SSv2, Kinetics400)上,我们的模型性能优于或与流行的(非因果)ViViT-L模型相当,同时参数量减少$3\times$,内存占用缩小$12\times$,FLOPs计数降低$5\times$(相较于全自注意力的ViViT),推理吞吐量约为每秒300帧,可轻松实现实时运行。与基于因果Transformer的模型(TSM, RViT)以及其他循环模型(如LSTM)相比,TRecViT在具有挑战性的SSv2数据集上取得了最先进的结果。代码和模型检查点可在 https://github.com/google-deepmind/trecvit 获取。