Accurate predictive models of the visual cortex neural response to natural visual stimuli remain a challenge in computational neuroscience. In this work, we introduce V1T, a novel Vision Transformer based architecture that learns a shared visual and behavioral representation across animals. We evaluate our model on two large datasets recorded from mouse primary visual cortex and outperform previous convolution-based models by more than 12.7% in prediction performance. Moreover, we show that the self-attention weights learned by the Transformer correlate with the population receptive fields. Our model thus sets a new benchmark for neural response prediction and can be used jointly with behavioral and neural recordings to reveal meaningful characteristic features of the visual cortex.
翻译:准确预测视觉皮层对自然视觉刺激的神经响应仍是计算神经科学中的一项挑战。本文提出V1T——一种新颖的基于Vision Transformer的架构,可学习跨动物的共享视觉与行为表征。我们在两个记录自小鼠初级视觉皮层的大型数据集上评估该模型,其预测性能较以往基于卷积的模型提升超过12.7%。此外,我们发现Transformer学习到的自注意力权重与群体感受野存在相关性。因此,本模型为神经响应预测建立了新基准,并可结合行为与神经记录共同揭示视觉皮层的有意义特征。