A multimodal LLM for the non-invasive decoding of spoken text from brain recordings

Brain-related research topics in artificial intelligence have recently gained popularity, particularly due to the expansion of what multimodal architectures can do from computer vision to natural language processing. Our main goal in this work is to explore the possibilities and limitations of these architectures in spoken text decoding from non-invasive fMRI recordings. Contrary to vision and textual data, fMRI data represent a complex modality due to the variety of brain scanners, which implies (i) the variety of the recorded signal formats, (ii) the low resolution and noise of the raw signals, and (iii) the scarcity of pretrained models that can be leveraged as foundation models for generative learning. These points make the problem of the non-invasive decoding of text from fMRI recordings very challenging. In this paper, we propose and end-to-end multimodal LLM for decoding spoken text from fMRI signals. The proposed architecture is founded on (i) an encoder derived from a specific transformer incorporating an augmented embedding layer for the encoder and a better-adjusted attention mechanism than that present in the state of the art, and (ii) a frozen large language model adapted to align the embedding of the input text and the encoded embedding of brain activity to decode the output text. A benchmark in performed on a corpus consisting of a set of interactions human-human and human-robot interactions where fMRI and conversational signals are recorded synchronously. The obtained results are very promising, as our proposal outperforms the evaluated models, and is able to generate text capturing more accurate semantics present in the ground truth. The implementation code is provided in https://github.com/Hmamouche/brain_decode.

翻译：人工智能中与大脑相关的研究主题近来备受关注，这尤其得益于多模态架构的能力从计算机视觉向自然语言处理领域的扩展。本工作的主要目标是探索这些架构在从非侵入性功能磁共振成像记录中解码口语文本方面的可能性与局限性。与视觉和文本数据不同，功能磁共振成像数据因其脑扫描仪的多样性而成为一种复杂的模态，这导致了（i）记录信号格式的多样性，（ii）原始信号的低分辨率和噪声，以及（iii）可用作生成学习基础模型的预训练模型的稀缺性。这些因素使得从功能磁共振成像记录中非侵入式解码文本的问题极具挑战性。本文中，我们提出了一种端到端的多模态大语言模型，用于从功能磁共振成像信号中解码口语文本。所提出的架构基于（i）一个源自特定Transformer的编码器，该编码器包含一个增强的嵌入层以及一个比现有技术中更优的注意力机制；（ii）一个经过适配的冻结大语言模型，用于对齐输入文本的嵌入表示与脑活动编码后的嵌入表示，以解码输出文本。在一个由一组人-人交互和人-机器人交互组成的语料库上进行了基准测试，其中功能磁共振成像信号与会话信号被同步记录。所获得的结果非常令人鼓舞，因为我们的方案优于所评估的模型，并且能够生成更准确地捕捉真实语义的文本。实现代码发布于 https://github.com/Hmamouche/brain_decode。