This study proposes a retinal prosthetic simulation framework driven by visual fixations, inspired by the saccade mechanism, and assesses performance improvements through end-to-end optimization in a classification task. Salient patches are predicted from input images using the self-attention map of a vision transformer to mimic visual fixations. These patches are then encoded by a trainable U-Net and simulated using the pulse2percept framework to predict visual percepts. By incorporating a learnable encoder, we aim to optimize the visual information transmitted to the retinal implant, addressing both the limited resolution of the electrode array and the distortion between the input stimuli and resulting phosphenes. The predicted percepts are evaluated using the self-supervised DINOv2 foundation model, with an optional learnable linear layer for classification accuracy. On a subset of the ImageNet validation set, the fixation-based framework achieves a classification accuracy of 87.72%, using computational parameters based on a real subject's physiological data, significantly outperforming the downsampling-based accuracy of 40.59% and approaching the healthy upper bound of 92.76%. Our approach shows promising potential for producing more semantically understandable percepts with the limited resolution available in retinal prosthetics.
翻译:本研究受眼跳机制启发,提出了一种基于视觉注视驱动的视网膜假体仿真框架,并通过分类任务中的端到端优化评估了其性能提升。我们利用视觉Transformer的自注意力图从输入图像中预测显著图像块以模拟视觉注视。这些图像块随后通过可训练的U-Net进行编码,并使用pulse2percept框架进行仿真以预测视觉感知。通过引入可学习的编码器,我们旨在优化传输至视网膜植入装置的视觉信息,以同时解决电极阵列分辨率有限以及输入刺激与诱发光幻视之间的失真问题。预测的感知结果使用自监督DINOv2基础模型进行评估,并可选择性地加入可学习线性层以提高分类准确率。在ImageNet验证集的子集上,基于真实受试者生理数据计算参数时,该注视驱动框架实现了87.72%的分类准确率,显著优于基于下采样的方法(40.59%),并接近健康视觉的上限值92.76%。我们的方法在视网膜假体有限分辨率条件下,展现出生成更具语义可理解性感知图像的潜力。