Transformers have achieved the state-of-the-art performance on solving the inverse problem of Snapshot Compressive Imaging (SCI) for video, whose ill-posedness is rooted in the mixed degradation of spatial masking and temporal aliasing. However, previous Transformers lack an insight into the degradation and thus have limited performance and efficiency. In this work, we tailor an efficient reconstruction architecture without temporal aggregation in early layers and Hierarchical Separable Video Transformer (HiSViT) as building block. HiSViT is built by multiple groups of Cross-Scale Separable Multi-head Self-Attention (CSS-MSA) and Gated Self-Modulated Feed-Forward Network (GSM-FFN) with dense connections, each of which is conducted within a separate channel portions at a different scale, for multi-scale interactions and long-range modeling. By separating spatial operations from temporal ones, CSS-MSA introduces an inductive bias of paying more attention within frames instead of between frames while saving computational overheads. GSM-FFN is design to enhance the locality via gated mechanism and factorized spatial-temporal convolutions. Extensive experiments demonstrate that our method outperforms previous methods by $>\!0.5$ dB with comparable or fewer complexity and parameters. The source codes and pretrained models are released at https://github.com/pwangcs/HiSViT.
翻译:Transformer在解决视频快照压缩成像(SCI)这一逆问题上已取得最先进的性能,该问题的病态性源于空间掩蔽与时间混叠的混合退化。然而,先前Transformer缺乏对退化机制的深入洞察,导致其性能与效率受限。本研究设计了一种无需在浅层进行时间聚合的高效重建架构,并以分层可分离视频Transformer(HiSViT)作为核心构建模块。HiSViT由多组跨尺度可分离多头自注意力(CSS-MSA)与门控自调制前馈网络(GSM-FFN)通过密集连接构成,每组操作均在独立通道子集的不同尺度上执行,以实现多尺度交互与长程建模。通过将空间与时间操作分离,CSS-MSA引入了更关注帧内而非帧间关系的归纳偏置,同时降低了计算开销。GSM-FFN通过门控机制与分解的时空卷积增强局部特征建模能力。大量实验表明,本方法在复杂度与参数量相当或更少的情况下,性能超越先前方法超过0.5 dB。源代码与预训练模型已发布于https://github.com/pwangcs/HiSViT。