Visual question answering (VQA) can be fundamentally crucial for promoting robotic-assisted surgical education. In practice, the needs of trainees are constantly evolving, such as learning more surgical types, adapting to different robots, and learning new surgical instruments and techniques for one surgery. Therefore, continually updating the VQA system by a sequential data stream from multiple resources is demanded in robotic surgery to address new tasks. In surgical scenarios, the storage cost and patient data privacy often restrict the availability of old data when updating the model, necessitating an exemplar-free continual learning (CL) setup. However, prior studies overlooked two vital problems of the surgical domain: i) large domain shifts from diverse surgical operations collected from multiple departments or clinical centers, and ii) severe data imbalance arising from the uneven presence of surgical instruments or activities during surgical procedures. This paper proposes to address these two problems with a multimodal large language model (LLM) and an adaptive weight assignment methodology. We first develop a new multi-teacher CL framework that leverages a multimodal LLM as the additional teacher. The strong generalization ability of the LLM can bridge the knowledge gap when domain shifts and data imbalances occur. We then put forth a novel data processing method that transforms complex LLM embeddings into logits compatible with our CL framework. We further design an adaptive weight assignment approach that balances the generalization ability of the LLM and the domain expertise of the old CL model. We construct a new dataset for surgical VQA tasks, providing valuable data resources for future research. Extensive experimental results on three datasets demonstrate the superiority of our method to other advanced CL models.
翻译:视觉问答(VQA)对于促进机器人辅助手术教育具有根本性的重要作用。在实践中,受训者的需求不断变化,例如学习更多手术类型、适应不同机器人,以及掌握单一手术中的新型手术器械和技术。因此,通过多来源的连续数据流持续更新VQA系统成为机器人手术应对新任务的迫切需求。在手术场景中,存储成本和患者数据隐私通常限制模型更新时旧数据的可用性,这要求采用无样本持续学习(CL)设置。然而,以往研究忽视了手术领域的两个关键问题:i) 来自多科室或临床中心的不同手术操作导致的显著领域迁移;ii) 由于手术过程中器械或活动出现不均衡导致的严重数据失衡。本文提出利用多模态大语言模型(LLM)和自适应权重分配方法来解决这两个问题。我们首先开发了一种新的多教师CL框架,将多模态LLM作为额外教师。LLM强大的泛化能力可弥合领域迁移和数据失衡时的知识鸿沟。随后我们提出了一种新颖的数据处理方法,将复杂的LLM嵌入转换为与CL框架兼容的logits值。我们进一步设计了自适应权重分配方法,平衡LLM的泛化能力与旧CL模型的领域专业知识。我们构建了手术VQA任务的新数据集,为未来研究提供了宝贵的数据资源。在三个数据集上的大量实验结果表明,我们的方法优于其他先进CL模型。