CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Medical students and junior surgeons often rely on senior surgeons and specialists to answer their questions when learning surgery. However, experts are often busy with clinical and academic work, and have little time to give guidance. Meanwhile, existing deep learning (DL)-based surgical Visual Question Answering (VQA) systems can only provide simple answers without the location of the answers. In addition, vision-language (ViL) embedding is still a less explored research in these kinds of tasks. Therefore, a surgical Visual Question Localized-Answering (VQLA) system would be helpful for medical students and junior surgeons to learn and understand from recorded surgical videos. We propose an end-to-end Transformer with the Co-Attention gaTed Vision-Language (CAT-ViL) embedding for VQLA in surgical scenarios, which does not require feature extraction through detection models. The CAT-ViL embedding module is designed to fuse multimodal features from visual and textual sources. The fused embedding will feed a standard Data-Efficient Image Transformer (DeiT) module, before the parallel classifier and detector for joint prediction. We conduct the experimental validation on public surgical videos from MICCAI EndoVis Challenge 2017 and 2018. The experimental results highlight the superior performance and robustness of our proposed model compared to the state-of-the-art approaches. Ablation studies further prove the outstanding performance of all the proposed components. The proposed method provides a promising solution for surgical scene understanding, and opens up a primary step in the Artificial Intelligence (AI)-based VQLA system for surgical training. Our code is publicly available.

翻译：医学生与初级外科医生在学习手术时通常依赖资深专家解答疑问。然而，专家常因临床及学术工作繁忙而难以抽出时间进行指导。同时，现有基于深度学习的手术视觉问答系统仅能提供简单答案，无法定位答案所在区域。此外，视觉-语言嵌入在此类任务中的研究仍较为匮乏。因此，开发面向手术的视觉问答定位系统将有助于医学生和初级医生从录播手术视频中学习理解。我们提出一种端到端Transformer框架，结合共注意力门控视觉语言嵌入（CAT-ViL），用于手术场景下的VQLA任务，该方法无需通过检测模型进行特征提取。CAT-ViL嵌入模块专为融合视觉与文本多模态特征而设计。融合后的嵌入将输入标准的数据高效图像Transformer模块，随后通过并行分类器与检测器实现联合预测。我们在MICCAI EndoVis Challenge 2017与2018的公开手术视频上开展实验验证。结果表明，与现有最优方法相比，所提模型在性能与鲁棒性方面均表现优越。消融实验进一步证明了各提出模块的卓越效能。本方法为手术场景理解提供了有效解决方案，并为基于人工智能的手术训练VQLA系统迈出了关键一步。相关代码已开源。