We consider the problem of video snapshot compressive imaging (SCI), where sequential high-speed frames are modulated by different masks and captured by a single measurement. The underlying principle of reconstructing multi-frame images from only one single measurement is to solve an ill-posed problem. By combining optimization algorithms and neural networks, deep unfolding networks (DUNs) score tremendous achievements in solving inverse problems. In this paper, our proposed model is under the DUN framework and we propose a 3D Convolution-Transformer Mixture (CTM) module with a 3D efficient and scalable attention model plugged in, which helps fully learn the correlation between temporal and spatial dimensions by virtue of Transformer. To our best knowledge, this is the first time that Transformer is employed to video SCI reconstruction. Besides, to further investigate the high-frequency information during the reconstruction process which are neglected in previous studies, we introduce variance estimation characterizing the uncertainty on a pixel-by-pixel basis. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) (with a 1.2dB gain in PSNR over previous SOTA algorithm) results. We will release the code.
翻译:我们研究视频快照压缩成像(SCI)问题,其中高速时序帧由不同掩码调制并通过单次测量捕获。从单一测量中重建多帧图像的基本原理是求解一个不适定问题。通过结合优化算法与神经网络,深度展开网络在解决逆问题方面取得了显著成就。本文提出的模型基于深度展开网络框架,设计了一个三维卷积-Transformer混合(CTM)模块,并嵌入高效可扩展的三维注意力机制,借助Transformer充分学习时空维度的相关性。据我们所知,这是Transformer首次被应用于视频SCI重建。为深入探究以往研究中忽视的重建过程高频信息,我们引入基于逐像素不确定性的方差估计。大量实验结果表明,所提方法达到当前最优性能(SOTA)(峰值信噪比相较于此前最优算法提升1.2dB)。我们将公开代码。