The success of deep learning models has led to their adaptation and adoption by prominent video understanding methods. The majority of these approaches encode features in a joint space-time modality for which the inner workings and learned representations are difficult to visually interpret. We propose LEArned Preconscious Synthesis (LEAPS), an architecture-agnostic method for synthesizing videos from the internal spatiotemporal representations of models. Using a stimulus video and a target class, we prime a fixed space-time model and iteratively optimize a video initialized with random noise. We incorporate additional regularizers to improve the feature diversity of the synthesized videos as well as the cross-frame temporal coherence of motions. We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of spatiotemporal convolutional and attention-based architectures trained on Kinetics-400, which to the best of our knowledge has not been previously accomplished.
翻译:深度学习模型的成功促使它们被众多主流视频理解方法所采纳和应用。这些方法大多在联合时空模态中编码特征,但其内部工作机制及学习到的表征难以进行视觉解释。我们提出了一种与架构无关的方法——学习预意识合成(LEAPS),该方法从模型的内部时空表征中合成视频。通过使用刺激视频和目标类别,我们激活一个固定的时空模型,并迭代优化一个以随机噪声初始化的视频。我们加入了额外的正则化项,以提升合成视频的特征多样性以及跨帧运动的时间连贯性。我们通过对在Kinetics-400上训练的多种时空卷积和基于注意力的架构进行逆向合成,定量和定性地评估了LEAPS的适用性——据我们所知,这一目标此前尚未实现。