Human vision is foveated, with variable resolution peaking at the center of a large field of view; this reflects an efficient trade-off for active sensing, allowing eye-movements to bring different parts of the world into focus with other parts of the world in context. In contrast, most computer vision systems encode the visual world at a uniform resolution, raising challenges for processing full-field high-resolution images efficiently. We propose a foveated vision interface (FOVI) based on the human retina and primary visual cortex, that reformats a variable-resolution retina-like sensor array into a uniformly dense, V1-like sensor manifold. Receptive fields are defined as k-nearest-neighborhoods (kNNs) on the sensor manifold, enabling kNN-convolution via a novel kernel mapping technique. We demonstrate two use cases: (1) an end-to-end kNN-convolutional architecture, and (2) a foveated adaptation of the foundational DINOv3 ViT model, leveraging low-rank adaptation (LoRA). These models provide competitive performance at a fraction of the computational cost of non-foveated baselines, opening pathways for efficient and scalable active sensing for high-resolution egocentric vision. Code and pre-trained models are available at https://github.com/nblauch/fovi and https://huggingface.co/fovi-pytorch.
翻译:人类视觉具有中心凹特性,其分辨率在广阔视野的中心区域达到峰值;这体现了主动感知的高效权衡机制,通过眼球运动将不同区域带入焦点,同时保持其他区域的上下文信息。相比之下,大多数计算机视觉系统以均匀分辨率编码视觉世界,导致处理全视野高分辨率图像时面临效率挑战。我们提出一种基于人类视网膜与初级视皮层(V1)的中心凹视觉接口(FOVI),可将类视网膜可变分辨率传感器阵列重组为均匀密集的类V1传感器流形。感受野被定义为传感器流形上的k最近邻域(kNN),通过新型核映射技术实现kNN卷积。我们展示了两个应用案例:(1)端到端的kNN卷积架构,(2)对基础DINOv3 ViT模型进行中心凹适配,并利用低秩适配(LoRA)技术。这些模型仅需非中心凹基线计算成本的一小部分即可实现具有竞争力的性能,为高分辨率第一人称视觉的高效可扩展主动感知开辟了新路径。代码与预训练模型发布于 https://github.com/nblauch/fovi 与 https://huggingface.co/fovi-pytorch。