Brain-IT-VQA: From Brain Signals to Answers

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.

翻译：[译]从观看图像时记录的fMRI信号中解码视觉内容，特别是回答关于所看到图像的问题，是一项长期挑战。虽然近年来基于fMRI的视觉问答（VQA）取得了显著进展，但性能仍然有限。此外，尽管当前模型能做出日益准确的预测，但它们很少被用作理解大脑视觉表征结构的工具。我们提出Brain-IT-VQA，一种基于fMRI的视觉问答框架。该方法基于大脑交互Transformer（Brain-IT），从大脑活动中解码语言标记，并将其与语言模型集成以回答视觉问题。我们的模型显著优于以往的fMRI图像描述和VQA方法。我们还引入NSD-VQA，这是一个基于fMRI进行视觉问答的新数据集与基准。与现有的图像-fMRI VQA数据集通常每张图像仅提供少数几个宽泛且控制不足的问题不同，NSD-VQA在20个经过控制的、涵盖多层次视觉理解的问题类别中，平均每张图像提供20个问答对。这使得在有限的fMRI测试数据下也能实现更可靠和可解释的评估。结合Brain-IT-VQA和NSD-VQA，我们既提供了一个强大的预测框架，也提供了一个研究大脑表征的工具。利用该基准，我们量化了从自然图像fMRI响应中可可靠解码的视觉与语义信息类型，并进一步分析了不同脑区在各问题类型中的贡献。