This study proposes a retinal prosthetic simulation framework driven by visual fixations, inspired by the saccade mechanism, and assesses performance improvements through end-to-end optimization in a classification task. Salient patches are predicted from input images using the self-attention map of a vision transformer to mimic visual fixations. These patches are then encoded by a trainable U-Net and simulated using the pulse2percept framework to predict visual percepts. By incorporating a learnable encoder, we aim to optimize the visual information transmitted to the retinal implant, addressing both the limited resolution of the electrode array and the distortion between the input stimuli and resulting phosphenes. The predicted percepts are evaluated using the self-supervised DINOv2 foundation model, with an optional learnable linear layer for classification accuracy. On a subset of the ImageNet validation set, the fixation-based framework achieves a classification accuracy of 87.72%, using computational parameters based on a real subject's physiological data, significantly outperforming the downsampling-based accuracy of 40.59% and approaching the healthy upper bound of 92.76%. Our approach shows promising potential for producing more semantically understandable percepts with the limited resolution available in retinal prosthetics.
翻译:本研究受眼跳机制启发,提出了一种基于视觉注视驱动的视网膜假体模拟框架,并通过分类任务中的端到端优化评估其性能提升。我们利用视觉Transformer的自注意力图从输入图像中预测显著图像块以模拟视觉注视。这些图像块随后通过可训练的U-Net进行编码,并利用pulse2percept框架进行模拟以预测视觉感知。通过引入可学习编码器,我们旨在优化传输至视网膜植入体的视觉信息,以同时解决电极阵列分辨率有限以及输入刺激与诱发光幻视之间的失真问题。预测的感知结果使用自监督DINOv2基础模型进行评估,并可选配可学习线性层以提升分类准确率。在ImageNet验证集子集上,基于真实受试者生理数据的计算参数,该注视驱动框架实现了87.72%的分类准确率,显著优于基于下采样的40.59%准确率,并接近92.76%的健康视力上限。我们的方法在视网膜假体有限分辨率条件下,展现出生成更具语义可理解性感知图像的巨大潜力。