Human vision is foveated, with variable resolution peaking at the center of a large field of view; this reflects an efficient trade-off for active sensing, allowing eye-movements to bring different parts of the world into focus with other parts of the world in context. In contrast, most computer vision systems encode the visual world at a uniform resolution, raising challenges for processing full-field high-resolution images efficiently. We propose a foveated vision interface (FOVI) based on the human retina and primary visual cortex (V1), that reformats a variable-resolution retina-like sensor array into a uniformly dense, V1-like sensor manifold. Receptive fields are defined as k-nearest-neighborhoods (kNNs) on the sensor manifold, enabling kNN-convolution via a novel kernel mapping technique. We demonstrate two use cases: (1) an end-to-end kNN-convolutional architecture, and (2) a foveated adaptation of the DINOv3 ViT foundation model, leveraging low-rank adaptation (LoRA). These models provide competitive performance with a fraction of the pixels and computational cost of full resolution non-foveated baselines, opening pathways for efficient and scalable active sensing for high-resolution egocentric vision. Code (https://github.com/nblauch/fovi) and pre-trained models (https://huggingface.co/fovi-pytorch) are available.
翻译:人类视觉具有中央凹特性,在大视场中心区域呈现峰值分辨率,这种非均匀分辨率分布体现了主动感知的高效权衡策略,通过眼动使视野各区域在上下文信息中依次聚焦。相比之下,多数计算机视觉系统以均匀分辨率编码视觉世界,给高效处理全视场高分辨率图像带来挑战。我们提出基于人类视网膜和初级视觉皮层(V1)的中央凹视觉接口(FOVI),将可变分辨率类视网膜传感器阵列重新格式化为均匀致密的V1样传感器流形。在该流形上通过k近邻(kNN)定义感受野,并创新性采用核映射技术实现kNN卷积。我们展示两个应用场景:(1)端到端kNN卷积架构;(2)基于低秩适应(LoRA)技术的DINOv3 ViT基础模型中央凹适配版本。这些模型在仅需全分辨率非中央凹基线模型像素量和计算成本若干分之一的情况下,即能实现相当的性能,为高分辨率自我中心视觉的高效可扩展主动感知开辟了新路径。代码(https://github.com/nblauch/fovi)与预训练模型(https://huggingface.co/fovi-pytorch)均已开放获取。