Extracting informative representations from videos is fundamental for effectively learning various downstream tasks. We present a novel approach for unsupervised learning of meaningful representations from videos, leveraging the concept of image spatial entropy (ISE) that quantifies the per-pixel information in an image. We argue that \textit{local entropy} of pixel neighborhoods and their temporal evolution create valuable intrinsic supervisory signals for learning prominent features. Building on this idea, we abstract visual features into a concise representation of keypoints that act as dynamic information transmitters, and design a deep learning model that learns, purely unsupervised, spatially and temporally consistent representations \textit{directly} from video frames. Two original information-theoretic losses, computed from local entropy, guide our model to discover consistent keypoint representations; a loss that maximizes the spatial information covered by the keypoints and a loss that optimizes the keypoints' information transportation over time. We compare our keypoint representation to strong baselines for various downstream tasks, \eg, learning object dynamics. Our empirical results show superior performance for our information-driven keypoints that resolve challenges like attendance to static and dynamic objects or objects abruptly entering and leaving the scene.
翻译:从视频中提取有效表征是学习各种下游任务的基础。我们提出了一种新方法,用于从视频中无人监督地学习有意义表征,该方法利用图像空间熵(ISE)量化图像中每个像素的信息。我们论证了像素邻域的局部熵及其时间演变为学习显著特征提供了有价值的内在监督信号。基于此思想,我们将视觉特征抽象为关键点的简洁表征,这些关键点作为动态信息传输器,并设计了一个深度学习模型,该模型完全无人监督地直接从视频帧中学习时空一致的表征。两个基于局部熵计算的原创信息论损失函数引导我们的模型发现一致的关键点表征:一个损失函数最大化关键点覆盖的空间信息,另一个损失函数优化关键点随时间的信息传输。我们将关键点表征与强基线方法在多种下游任务(例如学习物体动态)上进行对比。实验结果表明,我们的信息驱动关键点表现出优越性能,解决了如关注静态与动态物体或物体突然进入和离开场景等挑战。