All organisms make temporal predictions, and their evolutionary fitness level depends on the accuracy of these predictions. In the context of visual perception, the motions of both the observer and objects in the scene structure the dynamics of sensory signals, allowing for partial prediction of future signals based on past ones. Here, we propose a self-supervised representation-learning framework that extracts and exploits the regularities of natural videos to compute accurate predictions. We motivate the polar architecture by appealing to the Fourier shift theorem and its group-theoretic generalization, and we optimize its parameters on next-frame prediction. Through controlled experiments, we demonstrate that this approach can discover the representation of simple transformation groups acting in data. When trained on natural video datasets, our framework achieves better prediction performance than traditional motion compensation and rivals conventional deep networks, while maintaining interpretability and speed. Furthermore, the polar computations can be restructured into components resembling normalized simple and direction-selective complex cell models of primate V1 neurons. Thus, polar prediction offers a principled framework for understanding how the visual system represents sensory inputs in a form that simplifies temporal prediction.
翻译:所有生物都会进行时间预测,其进化适应程度取决于这些预测的准确性。在视觉感知中,观察者与场景中物体的运动共同构成了感官信号的动态结构,使得基于过去信号部分预测未来信号成为可能。在此,我们提出一种自监督表征学习框架,该框架通过提取并利用自然视频中的规律性来计算精准预测。我们借助傅里叶平移定理及其群论推广来论证极坐标架构的合理性,并在下一帧预测任务中优化其参数。通过受控实验,我们证明该方法能够发现数据中简单变换群的表征。当在自然视频数据集上训练时,我们的框架在预测性能上优于传统运动补偿方法,且与常规深度网络相匹敌,同时保持了可解释性与运算速度。此外,极坐标计算可重组为与灵长类动物V1神经元中标准化的简单细胞与方向选择性复杂细胞模型相似的组件。因此,极坐标预测为理解视觉系统如何以简化时间预测的形式表征感官输入提供了原则性框架。