SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

Advances in GPT-based large language models (LLMs) are revolutionizing natural language processing, exponentially increasing its use across various domains. Incorporating uni-directional attention, these autoregressive LLMs can generate long and coherent paragraphs. However, for visual question answering (VQA) tasks that require both vision and language processing, models with bi-directional attention or models employing fusion techniques are often employed to capture the context of multiple modalities all at once. As GPT does not natively process vision tokens, to exploit the advancements in GPT models for VQA in robotic surgery, we design an end-to-end trainable Language-Vision GPT (LV-GPT) model that expands the GPT2 model to include vision input (image). The proposed LV-GPT incorporates a feature extractor (vision tokenizer) and vision token embedding (token type and pose). Given the limitations of unidirectional attention in GPT models and their ability to generate coherent long paragraphs, we carefully sequence the word tokens before vision tokens, mimicking the human thought process of understanding the question to infer an answer from an image. Quantitatively, we prove that the LV-GPT model outperforms other state-of-the-art VQA models on two publically available surgical-VQA datasets (based on endoscopic vision challenge robotic scene segmentation 2018 and CholecTriplet2021) and on our newly annotated dataset (based on the holistic surgical scene dataset). We further annotate all three datasets to include question-type annotations to allow sub-type analysis. Furthermore, we extensively study and present the effects of token sequencing, token type and pose embedding for vision tokens in the LV-GPT model.

翻译：基于GPT的大语言模型（LLMs）的进展正在革新自然语言处理，并使其在众多领域的应用呈指数级增长。通过引入单向注意力机制，这些自回归式大语言模型能够生成连贯的长段落。然而，对于需要同时处理视觉与语言信息的视觉问答（VQA）任务，通常需要采用双向注意力模型或融合技术模型来同时捕捉多模态上下文。由于GPT本身无法处理视觉标记，为了在机器人手术的VQA中利用GPT模型的进展，我们设计了一种端到端可训练的语言-视觉GPT（LV-GPT）模型，该模型将GPT2模型扩展至包含视觉输入（图像）。所提出的LV-GPT集成了特征提取器（视觉标记化器）与视觉标记嵌入（标记类型与位置编码）。鉴于GPT模型中单向注意力的局限性及其生成连贯长段落的能力，我们精心将文本标记排序于视觉标记之前，模拟人类"先理解问题再从图像推断答案"的思维过程。定量实验证明，LV-GPT模型在两个公开的手术VQA数据集（基于2018年内窥镜视觉挑战机器人场景分割与CholecTriplet2021）以及我们新标注的数据集（基于全维度手术场景数据集）上均优于其他最先进的VQA模型。我们进一步对三个数据集补充了问题类型标注以实现亚型分析。此外，我们深入研究了视觉标记的序列顺序、标记类型与位置编码对LV-GPT模型的影响，并展示了相应实验结果。