This paper reveals that every image can be understood as a first-order norm+linear autoregressive process, referred to as FINOLA, where norm+linear denotes the use of normalization before the linear model. We demonstrate that images of size 256$\times$256 can be reconstructed from a compressed vector using autoregression up to a 16$\times$16 feature map, followed by upsampling and convolution. This discovery sheds light on the underlying partial differential equations (PDEs) governing the latent feature space. Additionally, we investigate the application of FINOLA for self-supervised learning through a simple masked prediction technique. By encoding a single unmasked quadrant block, we can autoregressively predict the surrounding masked region. Remarkably, this pre-trained representation proves effective for image classification and object detection tasks, even in lightweight networks, without requiring fine-tuning. The code will be made publicly available.
翻译:本文揭示每张图像均可理解为一种一阶范数+线性自回归过程(称为FINOLA),其中“范数+线性”表示在线性模型前使用归一化操作。我们证明,大小为256×256的图像可通过自回归方式从压缩向量重建至16×16的特征图,再经上采样与卷积处理实现。这一发现揭示了支配潜在特征空间的偏微分方程机制。此外,我们探索了FINOLA在简单掩码预测技术中用于自监督学习的应用:通过编码单个非掩码象限区块,可自回归预测周边掩码区域。值得注意的是,该预训练表征在无需微调的情况下,即可在轻量级网络上有效完成图像分类与目标检测任务。相关代码将公开发布。